Yeah I think this is the fundamental problem. But it's a very simple way to state it. Perhaps useful for someone who doesn't believe ai alignment is a problem?
Here's my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn't manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.
But more importantly, behavior upon self improvement and reflection is likely something we didn't test. Because we can't. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.
I read two very different things interaction in that scheme. There is authoring ontologies/frames and then there is the issue of mulliganing over valuations. The dive between using (human or outside provided) fancy representation to get a good grip on data which is relative easym is a different beast from being able make a representation where there previously was none. But the backsies issue happens even if we are stuck with a constant representation scheme, what we would naively train for now we would not train or would train the opposite at a possible future date.
Then there is the whole issue if the valuations change because the representations used to derive the values change structure. When you do stuff differently based on whether you do or do not use probabilities in your thinking.
Post AGI the bleeding edge scientific understanding is likely to be based on silicon generated representations. There the yardstick can't be that the result is correct as the humans are in a worse position to tell what is the correct result. So a big trick / challenge is about recognising/trusting that the machine conclusion should be in societal use when you would not have confidence with a biological brain coming up with a similar claim.
I don't know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it's confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn't, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
I dunno, I'd agree with LA Paul here. There's a difference between cases where you're not sure whether it's A, or it's B, or it's C, and cases where A, B, and C are all valid outcomes, and you're doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn't know what's right, but to the human it's obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined - where there are multiple ways we could be in the future that are all acceptably compatible with how we've been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn't approve of.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you're interested, here's a good talk.
That's fair. I think it's a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don't think just asking you questions "when it's confused" is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that's not research that's currently happening (though there might be something I'm not aware of).
I don't mean to distract from your overall point though which I take to be "a philosopher said a smart thing about AI alignment despite not having much exposure." That's useful data.
In the spring, I went to a talk with Brian Christian at Yale. He talked about his book, The Alignment Problem, and then there was an audience Q&A. There was a really remarkable question in that Q&A, which I have transcribed here. It came from the Yale philosophy professor L.A. Paul. I have since spoken to Professor Paul, and she has done some work on AI (and coauthored the paper “Effective Altruism and Transformative Experience”) but my general impression was that she hasn't yet spent a huge amount of time thinking about AI safety. Partly because of this question, I invited her to speak at the CAIS Philosophy Fellowship, which she will be doing in the spring.
The transcript below really doesn't do her question justice, so I'd recommend watching the recording, starting at 55 minutes.
During the talk, Brian Christian described reinforcement learning from human feedback (RLHF), specifically the original paper, where a model was trained with a reward signal generated by having humans rate which of two videos of a simulated robot was closer to a backflip. Paul's question is about this (punctuation added, obviously):
I'm not claiming that these are original ideas or that they represent all possible critiques of RLHF. Rather:
Also, the rest of the talk is pretty good too! Especially the Q&A, there were some other pretty good questions (including one from myself). But this one stood out for me.