I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:
- Some semitechnical intuition-building for high-dimensional problem-spaces.
- Optimization compresses information "by default"
- Resources and "instrumental convergence" without any explicit reference to agents
- A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
- The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
- Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
- Fermi estimate: just how complex are human values?
- Coherence arguments, presented the way I think they should be done.
- Also subagents!
Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.
Here's the video for part one:
Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.
I get that a truly flat area is synonymous with compression - but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity.
I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can't promise I'll follow up, if anyone is interested let me know.