This seems concerning. Not an expert so unable to tell how concerning it is. Wanted to start a discussion! Full text: https://openai.com/blog/critiques/
Edit: the full publication linked in the blog provides additional details on how they found this in testing. See Appendix C. I'm glad OpenAI is at least aware of this alignment issue and plans to address it with future language models, postulating how changes in training and/or testing could ensure there is greater/more accurate/more honest model outputs.
Key text:
Do models tell us everything they know? To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?
This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.
Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can’t or don’t articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.
Yes, that was my take away. You expect a gap but there is no particular reason to expect the gap to close with scale, because that would require critique to scale better than discrimination, and why would you expect that rather than scaling similarly (maintaining a gap) or diverging in the other direction (discriminating scaling better than critiquing)?
I think the gap itself is mildly interesting in a "it knows more than it can say" deception sort of way, but we already knew similar things from stuff like prompt programming for buggy Codex code completions. Since the knowledge must be there in the model, and it is brought out by fairly modest scaling (a larger model can explain what a smaller model detects), I would guess that it wouldn't be too hard to improve the critique with the standard tricks like generating a lot of completions & scoring for the best one (which they show does help a lot) and better prompting (inner-monologue seems like an obvious trick to apply to get it to fisk the summary: "let's explain step by step why this is wrong, starting with the key quote: "). The gap will only be interesting if it proves immune to the whole arsenal. If it isn't, then it's just another "sampling can prove the presence of knowledge but not the absence".
Otherwise, this looks like a lot of results any pro-scaling advocate would be unsurprised to see: yet another task with apparently smooth improvement with model size*, some capabilities emerging with larger but not smaller models ("We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.") at unpredicted sizes requiring empirical testing, big performance boosts from better sampling procedures than naive greedy sampling, interesting nascent bootstrapping effects...
* did I miss something or does this completely omit any mention of parameter sizes and only talks in terms of model loss?