I recently gave a two-part talk on the big picture of alignment, as I see it. The talk is not-at-all polished, but contains a lot of stuff for which I don't currently know of any good writeup. Linkpost for the first part is here; this linkpost is for the second part.
Compared to the first part, the second part has less material which has not been written up already, although it does do a better job tying it all into the bigger picture than any already-written source. I will link to relevant posts in the outline below.
Major pieces in part two:
- Programs as a compressed representation for large (potentially infinite) probabilistic causal models with symmetry
- Potentially allows models of worlds larger than the data structure representing the model, including models of worlds in which the model itself is embedded.
- Can't brute-force evaluate the whole model; must be a lazy data structure with efficient methods for inference
- The Pointers Problem: the inputs to human values are latent variables in humans' world models
- This is IMO the single most important barrier to alignment
- Other aspects of the "type signature of human values" problem (just a quick list of things which I'm not really the right person to talk about)
- Abstraction (a.k.a. ontology identification)
- Three roughly-equivalent models of natural abstraction
- Summary (around 1:30:00 in video)
I ended up rushing a bit on the earlier parts, in order to go into detail on abstraction. That was optimal for the group I was presenting to at the time I presented, but probably not for most people reading this. Sorry.
Here's the video:
Again, big thanks to Rob Miles for editing! (Note that the video had some issues - don't worry, the part where the camera goes bonkers and adjusts the brightness up and down repeatedly does not go on for very long.) The video includes some good questions and discussion from Adam Shimi, Alex Flint, and Rob Miles.
Yes, there are some details around who I recognize as a copy of me. In classical FDT this would be anyone who are running the same program (what ever that means). In evolution this would be anyone who are carrying the same genes. Both of these concept are complicated by "same program" and "same genes" are scalar (or more complicated?) and not Boolean values.Edit: I'm not sure I agree with what I just said. I believe something in this direction, but I want to think some more. For example, people with similar genes probably don't cooperate because decision theory (my decision to cooperate with you is correlated with your decision to cooperate with me), but because shared goals (we both want to spread our shared genes).