I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Major pieces in part one:
- Some semitechnical intuition-building for high-dimensional problem-spaces.
- Optimization compresses information "by default"
- Resources and "instrumental convergence" without any explicit reference to agents
- A frame for thinking about the alignment problem which only talks about high-dimensional problem-spaces, without reference to AI per se.
- The central challenge is to get enough bits-of-information about human values to narrow down a search-space to solutions compatible with human values.
- Details like whether an AI is a singleton, tool AI, multipolar, oracle, etc are mostly irrelevant.
- Fermi estimate: just how complex are human values?
- Coherence arguments, presented the way I think they should be done.
- Also subagents!
Note that I don't talk about timelines or takeoff scenarios; this talk is just about the technical problem of alignment.
Here's the video for part one:
Big thanks to Rob Miles for editing! Also, the video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):
Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind.
I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do.
Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4.
My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights).
The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don't matter much) which would not result in some subset of the parameters having all the useful information.
If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training?
If you disagree I'd be curious to know where. I sense that Mingard et al shares your conclusion but I don't yet understand the claimed empirical demonstration.
tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.
Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.
... (read more)