In list of lethalities, it seems that the two biggest ones are:
- A.3 We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.
- B.1.10 On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
My understanding is that interpretability is currently tackling the second one. But what about the first one?
It seems a bit tricky because it is a powerful outside view argument. It is incredibly rare for software to work on the first test. ML makes it even more difficult since it isn't well suited to formal verification. Even defense in depth seems unlikely to work (on the first critical try, there is likely only one system that is situational aware). The only thing I can think of is making the AGI smart enough to takeover the world with the help of its creators but not smart enough to do so on its own or to solve it's own alignment problem (i.e. it does not know how to improve without goal drift). I also suppose non-critical tries give some data, but is it enough?
What does the playing field for the first critical try look like?
You are making a type error. It's like saying "Sure your transformer arch worked ok on wikipedia, but obviously it won't work on news articles .... or the game of Go, ... or videos ... or music."
This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment - the equivalent of the transformer arch. Your use of the term 'OOD' implies you are thinking we train some large agent model in sim then deploy in reality. That isn't the proposal.
The simulations do not need to "match reality" because intelligence and alignment are strongly universal and certainly don't depend on presence of technology like computers. Simulations can test many different scenarios/environments which is useful for coverage of the future.
As an example - the human genome hasn't changed much in the last 2,000 years, and yet inter-human brain alignment (love/altruism/empathy etc) works about as well now as it did 2,000 years ago. That is a strict lower bound example that AGI alignment architecture can improve on.