I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:
- Some semitechnical intuition-building for high-dimensional problem-spaces.
- Optimization compresses information "by default"
- Resources and "instrumental convergence" without any explicit reference to agents
- A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
- The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
- Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
- Fermi estimate: just how complex are human values?
- Coherence arguments, presented the way I think they should be done.
- Also subagents!
Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.
Here's the video for part one:
Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
>Like, ability-to-narrow-down-a-search-space-or-behavior-space-by-a-factor-of-two is what a bit of information is.
Information is an upper bound, not a lower bound. The capacity of a channel gives you an upper bound on how many distinct messages you can send, not a lower bound on your performance on some task using messages sent over the channel. If you have a very high info-capacity channel with someone who speaks a different language from you, you don't have an informational problem, you have some other problem (a translation problem).
>If we can't use the information to narrow down a search space closer to the thing-the-information-is-supposedly-about, then we don't actually have any information about that thing.
This seems to render the word "information" equivalent to "what we know how to do", which is not the technical meaning of information. Do you mean to do that? If so, why? It seems like a misframing of the problem, because what's hard about the problem is that you don't know how to do something, and don't know how to gather data about how to do that thing, because you don't have a clear space of possibilities with a shattering set of clear observable implications of those possibilities. When you don't know how to do something and don't have a clear space of possibilities, the sort of pieces of progress you want to make aren't fungible with each other the way information is fungible with other information.
[ETA: Like, if the space in question is the space of which "human values" is a member, then I'm saying, our problem isn't locating human values in that space, our problem is that none of the points in the space are things we can actually implement, because we don't know how to give any particular values to an AGI.]