tailcalled

Sequences

Linear Diffusion of Sparse Lognormals: Causal Inference Against Scientism

Wiki Contributions

Comments

Sorted by

Those are sort of counterstatements against doom, explaining that you don't see certain problems that doomers raise. But the OP more attempts to just make an independently-standing argument about what is present.

It's still not obvious to me why adversaries are a big issue. If I'm acting against an adversary, it seems like I won't make counter-plans that lead to lots of side-effects either, for the same reasons they won't. 

I mean we can start by noticing that historically, optimization in the presence of adversaries has lead to huge things. The world wars wrecked Europe. States and large bureaucratic organizations probably exist mainly as a consequence of farm raids. The immune system tends to stress out the body a lot when it is dealing with an infection. While it didn't actually trigger, the nuclear arms race lead to existential risk for humanity, and even though it didn't trigger the destruction, it still made people quite afraid of e.g. nuclear power. Etc..

Now, why does trying to destroy a hostile optimizer tend to cause so much destruction? I feel like the question almost answers itself.

Or if we want to go mechanistic about it, one of the ways to fight back the nazis is with bombs, which deliver a sudden shockwave of energy that has the property of destroying nazi structures and everything else. It's almost constitutive of the alignment problem: we have a lot of ways of influencing the world a lot, but those methods do not discriminate between good and evil/bad.

From an abstract point of view, many coherence theorems rely on e.g. Dutch books, and thus become much more applicable in the case of adversaries. The coherence theorem "if an agent achieves its goals robustly regardless of environment, then it stops people who want to shut it down" can be trivially restated as "either an agent does not achieve its goals robustly regardless of environment, or it stops people who want to shut it down", and here non-adversarial agents should obviously choose the former branch (to be corrigble, you need to not achieve your goals in an environment where someone is trying to shut you down).

From a more strategic point of view, when dealing with an adversary, you tend to become a lot more constrained on resources because if the adversary can find a way to drain your resources, then it will try to do so. Ways to succeed include:

  • Making it harder for people to trick you into losing resources, by e.g. making it harder for people to predict you, and being less trusting of what people tell you, and wining as quickly as possible
  • Gaining more resources by grabbing them from elsewhere

Also, in an adversarial context, a natural prior is that inconveniences are there for a reason, namely to interfere with you. This tends to make enemies.

I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren't. Why do I think they're plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn't seem like a very complex claim, so I assign decent odds to it. There's some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that's not super strong, and there are other simple hypotheses for how things go, so I don't assign more than like 10% credence to the hypothesis. 

Mesa-optimizers definitely exist to varying degrees, but they generally try to not get too involved with other things. Mechanistically, we can attribute this to imitation learning, since they're trying to mimick human's tendency to stitch together strategies in a reasonable way. Abstractly, the friendliness of instrumental goals shows us why unbounded unfriendly utility maximizers are not the only or even main attractor here.

(... Some people might say that we have a mathematical model of unbounded unfriendly utility maximizers but not of friendlier bounded instrumental optimizers. But those people are wrong because the model of utility maximizers assumes we have an epistemic oracle to handle the updating, prediction and optimization for us, and really that's the computationally heavy part. One of the advantages of more bounded optimization like in the OP is that it ought to be more computationally tractable because different parts of the plans interfere less with each other. It's not really fair to say that we know how utility maximizers work when they outsource the important part to the assumptions.)

Gasses typically aren't assembled by trillions of repetitions of isolating an atom and inserting it into a container. Gas canisters are (I assume) assembled by e.g. compressing some resevoir (even simply a fraction of the atmosphere) or via a chemical reaction that produces the gas, and in these cases such procedures constitute the long-tailed variable that I am talking about in this series. (They are large relative to the individual particle velocities, and the particle velocities are a diminished form of the creation procedure as e.g. some ways of creating the gas leaves it more hot or similar.) Gasses in nature also have long-tailed causes, e.g. the atmosphere is collected due to the Earth's gravitational pull. (I think particles in outer space would technically not constitute a gas, but their velocities are AFAIK long-tailed due to coming from quasars and such.)

Generally you wouldn't since it's busy using that matter/energy for whatever you asked it to do. If you wanted to use it, presumably you could turn down its intensity, or maybe it exposes some simplified summary that it uses to coordinate economies of scale.

Once you start getting involved with governance, you're going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.

If you're assuming that verification is easier than generation, you're pretty much a non-player when it comes to alignment.

I'm not interested in your key property, I'm interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.

I was considering doing something like this, but I kept getting stuck at the issue that it doesn't seem like gradients are an accurate attribution method. Have you tried comparing the attribution made by the gradients to a more straightforward attribution based on the counterfactual of enabling vs disabling a network component, to check how accurate they are? I guess I would especially be curious about its accuracy on real-world data, even if that data is relatively simple.

I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?

For the former I'd need to hear your favorite argument in favor of the neurosis that inner alignment is a major problem.

For the latter, in the presence of adversaries, every subgoal has to be robust against those adversaries, which is very unfriendly.

Load More