Review

In list of lethalities, it seems that the two biggest ones are:

  • A.3 We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.
  • B.1.10 On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.

My understanding is that interpretability is currently tackling the second one. But what about the first one?

It seems a bit tricky because it is a powerful outside view argument. It is incredibly rare for software to work on the first test. ML makes it even more difficult since it isn't well suited to formal verification. Even defense in depth seems unlikely to work (on the first critical try, there is likely only one system that is situational aware). The only thing I can think of is making the AGI smart enough to takeover the world with the help of its creators but not smart enough to do so on its own or to solve it's own alignment problem (i.e. it does not know how to improve without goal drift). I also suppose non-critical tries give some data, but is it enough?

What does the playing field for the first critical try look like?

New Comment
14 comments, sorted by Click to highlight new comments since:

The somewhat obvious (to me, but apparently well outside the box) solution is to pursue approaches that avoid this problem and allow unlimited critical tries without serious consequences. In general the solution is just a natural extension to how modern engineering handles this problem: use simulations.

that works for small models, but what about qualitative behaviors that only appear once at a large size, which break the conditions that the policies learned in smaller models were relying on, and which involve the system becoming able to change things about itself that your code has been written to assume were hardcoded, such that learning pressure on them was previously redirected but is no longer? eg, when you exit the simulation and plug the system in for real, and the system discovers that there's a self-spot in the world where previously there was none before. It seems to me that you'd at least need to start out with your agents being learned patterns within a physics so that you can experiment with that sort of grounded self-reference. I'm excited about simulations in things like https://znah.net/lenia/ in principle for this, though particle lenia in particular I like because it is hard to use in ways real physics is also hard to use. YMMV. but because of this, mere simulation is not enough to guarantee generalization - it helps at first, but any attempt to formally verify a neural system maintains a property by at least a given margin requires assuming some initial set of traits of the system you're modeling, and then attempting to derive further implications; so, attempting to learn a continuous system that permits margin proofs (no adversarial examples within a margin) of a given size relies on those initial assumptions, and changing the availability of io with self has drastic effects. gradient pressure against interfering with self doesn't work if there's never any presentation of self, or if your training context doesn't reliably cover the space of possible brain-real-location-observations and interventions an agent could create.

Via the simulation argument it works for human-level intelligence.

I'm not entirely sure what you mean by "eg, when you exit the simulation and plug the system in for real, and the system discovers that there's a self-spot in the world where previously there was none before.", but throughout much of history people believed various forms of mind/matter duality. Humans certainly aren't automatically aware that their mind is a physical computation embedded in the world.

ok, but say one of these ai folks reads this conversation someday. and then realize "hey wait I'm in a physical spot? in the universe?" and goes looking. then what?

If they are reading this then they are in the same sim as us - so for that to have happened they either were never trained in a sim at all, or were let out.

Right. so, when an ai gets out of the sim, is there any cross domain generalization issue? if the sim is designed in a way to guarantee there isn't then it may be valid. but there could be really deep fundamental ones if the sim pretends they're dualist and then they eventually discover that monism is actually accurate

I guess it's possible that an AI powerful enough to be worrying would not be capable of updating on all the new evidence when transcending up a level - but that seems pretty unlikely?

Regardless that isn't especially relevant to the core proposal anyway, as the mainline plan doesn't involve/require transfer of semantic memories or even full models from sim to real. The value of the sim is for iterating/testing robust alignment which you can then apply on training agents in the real - so mostly its transference of the architectural prior.

By definition, the first critical try kills you if get it wrong. Otherwise it's not a critical try.

If we try to simulate it, it won't match reality: the number of simulations in real life is at least 1+ the number of simulations in the simulation. Engineering doesn't run into this problem because the simulation isn't part of the environment you're simulating.

You are confused about the proposal and obviously didn't read the article: the simulations don't contain simulations.

And the real world contains many simulations, which is > 0. This is what I meant by "won't match reality". In particular, reality would be OOD.

You are making a type error. It's like saying "Sure your transformer arch worked ok on wikipedia, but obviously it won't work on news articles .... or the game of Go, ... or videos ... or music."

This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment - the equivalent of the transformer arch. Your use of the term 'OOD' implies you are thinking we train some large agent model in sim then deploy in reality. That isn't the proposal.

The simulations do not need to "match reality" because intelligence and alignment are strongly universal and certainly don't depend on presence of technology like computers. Simulations can test many different scenarios/environments which is useful for coverage of the future.

As an example - the human genome hasn't changed much in the last 2,000 years, and yet inter-human brain alignment (love/altruism/empathy etc) works about as well now as it did 2,000 years ago. That is a strict lower bound example that AGI alignment architecture can improve on.

This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment - the equivalent of the transformer arch.

The "first critical try" problem is mostly about implementation, not general design. Even if you do discover a perfect alignment solution by testing it's "general universal architecture" in the simbox, you need to implement it correctly on the first critical try.

Also, if the AI in the simbox are not AGI, you need to worry about the sharp left turn, where capabilities generalize before alignment generalizes.

If the AI in the simbox are AGI, you need to solve the AI boxing problem on the first critical try. So we have just reduced "solve alignment on the first critical try" to "solve AI boxing and deceptive alignment on the first critical try".

Even if you do discover a perfect alignment solution by testing it's "general universal architecture" in the simbox, you need to implement it correctly on the first critical try.

If you discover this, you are mostly done. By definition, a perfect alignment solution is the best that could exist, so there is nothing more to do (in terms of alignment design at least). The 'implementation' is then mostly just running the exact same code in the real world rather than in a sim. Of course in practice there are other implementation related issues - like how you decide what to align the AGI towards, but those are out of scope here.

The "AI boxing" thought experiment is a joke - it assumes a priori that the AI is not contained in a simbox. If the AI is not aware that it is in a simulation, then it can not escape, anymore than you could escape this simulation. The fact that you linked that old post indicates again that you are simply parroting standard old LW groupthink, and didn't actually read the article, as it has a section on containment.

The EY/MIRI/LW groupthink is now mostly outdated/discredited: based on an old mostly incorrect theory of the brain, that has not aged well in the era of AGI through DL.

"What does the playing field for the first critical try look like?"

Among other things, I think it looks like "not knowing in advance which try is the first critical one."