I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:
- Some semitechnical intuition-building for high-dimensional problem-spaces.
- Optimization compresses information "by default"
- Resources and "instrumental convergence" without any explicit reference to agents
- A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
- The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
- Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
- Fermi estimate: just how complex are human values?
- Coherence arguments, presented the way I think they should be done.
- Also subagents!
Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.
Here's the video for part one:
Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
>The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
This sounds like a fundamental disagreement with Yudkowsky's view. (I think) Yudkowsky thinks the hardest part about alignment is getting an AGI to do any particular specified thing (that requires superhuman general intelligence) at all, whatever it may be, whereas by default AGI will optimize hard for something that no programmer had in mind; rather than the problem being about pointing at particular values. Do you recognize this as a disagreement, and what do you think of it? Do you think aiming-at-all is not that hard, or isn't usefully separated from pointing at human values?
In that case, it's not about human values, which is one of the very nice things the natural abstraction hypothesis buys us.