Their empirical result rhymes with adversarial robustness issues - we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.
I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any "scale", whether we look at them token by token or read a high-level summary of them. However, I suspect their "weird, low-KL" generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it's not immediately obvious if this translates to low and high probability summaries respectively - one would need to test). I think a KL penalty to the "true base policy" should operate this way automatically, but as the authors note we can't actually implement that.
Is your view closer to:
Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.
There's a regularization problem to solve for 3.9 and 4, and it's not obvious to me that glee will be enough to solve it (3.9 = "unintelligible CoT").
I'm not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it's not easy to do away with the native understanding of language. While it's true that there is some amount of data that will enable large divergences from the pretrained distribution - and I could imagine mathematical proof generation eventually reaching this point, for example - more ambitious goals inherently come with less data, and it's not obvious to me that there will be enough data in alignment-critical applications to cause such a large divergence.
There's an alternative version of language invention where the model invents a better language for (e.g.) maths then uses that for more ambitious projects, but that language is probably quite intelligible!
For what it's worth, one idea I had as a result of our discussion was this:
So philosophers like "pain is bad" as a moral foundation because we want to believe it + it is hard to challenge with evidence or reason. Laypeople probably have lots of foundational moral beliefs that don't stand up as well to evidence or reason, but (perhaps) are equally attributable to motivated reasoning.
Social pressure is a bit iffy to include because I think lots of people relate to beliefs that they adopted because of social pressure as moral foundations, and believing something because you're under pressure to do so is an instance of motivated reasoning.
I don't think this is a response to your objections, but I'm leaving it here in case it interests you.
I can explain why I believe bachelors are unmarried: I learned that this is what the word bachelor means, I learned this because it is what bachelor means, and the fact that there's a word "bachelor" that means "unmarried man" is contingent on some unimportant accidents in the evolution of language. A) it is certainly not the result of an axiomatic game and B) if moral beliefs were also contingent on accidents in the evolution of language (I think most are not), that would have profound implications for metaethics.
Motivated belief can explain non-purely-selfish beliefs. I might believe pain is bad because I am motivated to believe it, but the belief still concerns other people. This is even more true when we go about constructing higher order beliefs and trying to enforce consistency among beliefs. Undesirable moral beliefs could be a mark against this theory, but you need more than not-purely-selfish moral beliefs.
I'm going to bow out at this point because I think we're getting stuck covering the same ground.
Thanks for your continued engagement.
I’m interested in explaining foundational moral beliefs like suffering is bad, not beliefs like “animals do/don’t suffer”, which is about badness only because we accept the foundational assumption that suffering is bad. Is that clear in the updated text?
Now, I don’t think these beliefs come from playing axiomatic games like “define good as that which increases welfare”. There are many lines of evidence for this. First: “define bad as that which increases suffering” is not equally as plausible as “define good as that which increases suffering”. We have pre-existing beliefs about this.
Second: you talk about philosophers analysing welfare. However, the method that philosophers use to do this usually involves analysing a bunch of fundamental moral assumptions. For example, from the Stanford encyclopaedia of philosophy:
Correspondingly, no amount of empirical investigation seems by itself, without some moral assumption(s) in play, sufficient to settle a moral question https://plato.stanford.edu/entries/metaethics/
I am suggesting that the source of these fundamental moral assumptions may not be mysterious - we have a known ability to form beliefs based on what we want, and fundamental moral beliefs often align with what we want.
I think precisely defining "good" and "bad" is a bit beside the point - it's a theory about how people come to believe things are good and bad, and we're perfectly capable of having vague beliefs about goodness and badness. That said, the theory is lacking a precise account of what kind of beliefs it is meant to explain.
The LLM section isn't meant as support for the theory, but speculation about what it would say about the status of "experiences" that language models can have. Compared to my pre-existing notions, the theory seems quite willing to accommodate LLMs having good and bad experiences on par with those that people have.
I have a pedantic and a non-pedantic answer to this. Pedantic: you say X is "usually considered good" if it increases welfare. Perhaps you mean to imply that if X is usually considered good then it is good. In this case, I refer you to the rest of the paragraph you quote.
Non-pedantic: yes, it's true that once you accept some fundamental assumptions about goodness and badness you can go about theorising and looking for evidence. I'm suggesting that motivated reasoning is the mechanism that makes those fundamental assumptions believable.
I added a paragraph mentioning this, because I think your reaction is probably common.
If you're in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen "normal behaviour" to normal behaviour in your situation. Reinforcement learning is limited - you can't always extrapolate past reward - but it's not obvious that imitative regularisation is fundamentally more limited.
(normal does not imply safe, of course)