Epistemic status: I am just learning about alignment and just read Human Compatible. Below is a summary of the paradigm he outlines for aligning AI in the last third of his book, and the questions I have of this project as a new reader.
AI researcher Stuart Russell’s 2019 book Human Compatible is one of the most popular/widely-circulated books on AI alignment right now. It argues that we need to change the paradigm of AI development in general from what he calls goal-directed behavior, where a machine optimizes on a reward function written by humans, to behavior that attempts to learn and follow human objectives.
Russell provides a set of general design principles that... (read 1556 more words →)
TBH my naive thought is that if John's project succeeds it'll solve most of what I think of as the hard part of alignment, and so it seems like one of the more promising approaches to me, but in my model of the world it seems quite unlikely that there are natural abstractions in the way that John seems to think there are.