Richard Ngo writes
Since Stuart Russell's proposed alignment solution in Human Compatible is the most publicly-prominent alignment agenda, I should be more explicit about my belief that it almost entirely fails to address the core problems I expect on realistic pathways to AGI.
Specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part as a red herring. https://arbital.com/p/updated_deference/… In other words, Russell gives a wrong-way reduction.
I originally included CIRL in my curriculum (https://docs.google.com/document/d/1mTm_sT2YQx3mRXQD6J2xD2QJG1c3kHyvX8kQc_IQ0ns/edit?usp=drivesdk…) out of some kind of deferent/catering to academic mainstream instinct. Probably a mistake; my current annoyance about deferential thinking has reminded me to take it out.
Howie writes:
My impression is that ~everyone I know in the alignment community is very pessimistic about SR's agenda. Does it sound right that your view is basically a consensus? (There's prob some selection bias in who I know).
Richard responds:
I think it's fair to say that this is a pretty widespread opinion. Partly it's because Stuart is much more skeptical of deep learning (and even machine learning more generally!) than almost any other alignment researcher, and so he's working in a different paradigm.
Is Richard correct and if so why? (I would also like a clearer explanation why Richard is skeptical of Stuart's agenda. I agree that the reframing doesn't completely solve the problem, but I don't understand why it can't be a useful piece).
I'm not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak. I also found Paul Christiano's response very helpful as an outline of objections to the utility uncertainty agenda.
Also relevant is this old comment from Rohin on difficulties with utility uncertainty.