+1 for interesting investigations. I want to push back on your second point, though - the framing of the problem of high-level distributional shift. I don't think this actually captures the core thing we're worried about. For example, we can imagine a model that remains in the same environment, but becomes increasingly intelligent during training, until it realises that it has the option of doing a treacherous turn. Or we can think about the case of humans - the core skills and goals that make us dangerous to other species developed in our ancestral environment, which led to us changing our own environments. So the distributional shift was downstream of the underlying problem.
Also, in the real world, everything undergoes distributional shift all the time, so the concept doesn't narrow things down.
Thanks, Richard!
I do think both of those cases fit into the framework fine (unless I'm misunderstanding what you have in mind):
In other words, if we imagine a model misbehaving in the wild, I think it'll usually either be the case that (1) it behaved that way during training but we didn't notice the badness (evaluation breakdown), or (2) we didn't train it on a similar enough situation (high-level distribution shift).
As we move further away from standard DL training practices, we could see failure modes that don't fit into these two categories -- e.g. there could be some bad fixed-point behaviors in amplification that aren't productively thought of as "evaluation breakdown" or "high-level distribution shift." But these two categories do seem like the most obvious ways that current DL practice could produce systematically harmful behavior, and I think they take up a pretty large part of the space of possible failures.
(ETA: I want to reiterate that these two problems are restatements of earlier thinking, esp. by Paul and Evan, and not ideas I'm claiming are new at all; I'm using my own terms for them because "inner" and "outer" alignments have different meanings for different people.)
(Short low-effort reply since we'll be talking soon.)
we don't visit those situations during training, but they do in fact come up in practice (distribution shift)
If you're using this definition of distributional shift, then isn't any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn't cause catastrophes in the training environment?
In general I'm not claiming that distributional shift isn't happening in the leadup to catastrophes, I'm denying that it's an interesting way to describe what's going on. An unfair straw analogy: it feels kinda like saying "the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts". Which is... not wrong, most dangerous behaviour doesn't happen in sub-saharan Africa. But it doesn't shed much light on what's happening: the danger is being driven by our cognition, not by high-level shifts in our environments.
I've been clarifying my own understanding of the alignment problem over the past few months, and wanted to share my first writeups with folks here in case they're useful:
https://www.danieldewey.net/risk/
The site currently has 3 pages:
None of the ideas on the site are particularly new, and as I note, they're not consensus views, but the version of the basic case I lay out on the site is very short, doesn't have a lot of outside dependencies, and is put together out of nuts-and-bolts arguments that I think will be useful as a starting point for alignment work. I'm particularly hoping to avoid semantic arguments about "what counts as" inner vs outer alignment, optimization, agency, etc., in favor of more mechanical statements of how models could behave in different situations.
I think some readers on this forum will already have been thinking about alignment this way, and won't get a lot new out of the site; some (like me) will find it to be a helpful distillation of some of the major arguments that have come out over the past ~5 years; and some will have disagreements (which I'm curious to hear about).
I thought about posting all of this directly on the Alignment Forum / LessWrong, but ultimately decided I wanted a dedicated home for these ideas.
Out of everything on the site, the part I'm most hoping will be helpful to you is my (re)statement of two main problems in AI alignment. These map roughly onto outer and inner alignment, though different people use those terms differently, so not everyone will agree:
What's next? Ultimately, I'm hoping to figure out what kinds of research projects are most likely to produce forward progress towards training methods that avoid evaluation breakdown and high-level distribution shift. A world where we're making clear year-over-year progress towards these goals looks achievable to me.