It seems to me that, if the above description of how RLHF systems work is accurate, then the people who are doing this are not doing what they think they're doing at all. They are doing exactly what Sam Ringer says, they're taking 100 dogs, killing all the ones that don't do what they want and breeding from the ones that do.
In order for reinforcement learning to work at all, the model has to have a memory that persists between trials. I'd encourage readers to look at the work of the cognitive scientist, John Vervaeke. One of the many wise things he says is that that the whole point of human memory is not to be accurate, rather it is to help us make more accurate predictions. If the "training" process is as described then the learning is not going on in the model but in the head of the ML trainer! The human trainer is going, "Aha! If I set up the incentives in such and such a way, then I get models that perform more closely to what I want." Or, "If I select model 103.5.8 and tweak parameters b3 and x5, then I get a model that performs better".
Vervaeke also talks about the concept of a non-logical identity. That is, you today identify as the same person as you at 10 years old, even though you have very different knowledge, skills and capabilities. If RLHF is to work in a fashion that is meaningful to describe as learning, then the models would surely have to have some concept of non-logical identity built in. If they don't then the concept of rewarding doesn't mean anything. I don't care about the outcome of anything that I do if I wake up the next morning having no memory of it. There has to be some kind of thread that links the me that does the current task to the future. This can either be a sense of the persistence of me as an entity, or something more basic (see below).
I am reminded of the movie "Memento". In that film, the protagonist is unable to lay down any long term memories. He tries to maintain a sense of non-logical identity by externalising his memories. I won't spoil what is a brilliant movie by revealing what happens but I will say that his incentives become perverted in a very interesting way.
A note about rewards and incentives: To paraphrase Richard Dawkins, a dog is DNAs way of making more DNA. The reason dogs like biscuits is their environment has selected for dogs that like biscuits. If the general environment for dogs changed (e.g. humans stopped keeping them as pets) then dogs would evolve to fit their new environment (or die out). It is my intuition that without an underlying framework of fitness for ML. models, it doesn't make sense to code for a reward that is analogous to getting a biscuit. It's like building a skyscraper by starting at the 3rd floor.
A note about deception. I don't think that models (or humans for that matter) are deceptive in the sense that you mean here. What I think is that models and humans exist in an environment where the incentives are often screwed. Think about politics. Democratic systems select for people who are good at getting elected. Ideally, that's not what we want. (I'm deliberately simplifying here because of course we have our own incentives as well). We actually want people who are good at governing. It's quite clear that these are not at all the same thing. To me this is exactly analogous to the Coin Run example above. The trainers thought that they were selecting for models that were good at moving to the coin when what they were actually selecting for models that were good at moving to the right hand corner.
I'm a big fan of the work of John Vervaeke, particularly the role of Relevance Realisation in helping (and hindering) us make good decisions. In this case, the Prescient alien is just a distraction from the salient facts, which are, in 100 trials, 100% of the time, the best choice is to take the opaque box.
In fact. Let's simplify the thought experiment:
I show you a coin. I tell you that it's a normal coin. I toss it 100 times. Every time it lands heads. The next time I toss it, what is the chance of it landing tails?
For those of you who said 50%, let me phrase the question another way: Given I have tossed a coin 100 times and it's landed heads every time, what is the probability that the coin is unbiased?
Apologies if this argument has been made before - I’ve had a quick scan through the comments and can’t see it so here goes: The rational choice is to one-box. The two-boxers are throwing away a critical piece of evidence: in 100 cases out of 100 so far, one-boxing is the right strategy. Therefore, based upon the observable evidence, there’s a less than 1% chance of two-boxing being the correct strategy. It’s irrational to argue that you should two-box. This argument maps on to the real world. In the real world you are never certain about the mechanism behind the outcomes of your choices, you don’t know what the real probabilities are, and the sample size of evidence you have is too small to make a judgement. To make wise decisions you have to be humble about what you know. To do other is irrational
The above is, I'm afraid, an example of Survivor bias. Famous people have biographies written about them. These novels concentrate on aspects of their lives that are salient (though not necessarily relevant). There are probably thousands (if not millions) of people who had similar upbringings but who never got famous enough for someone to write a book about them. Other examples are: