I think that is a very strong conclusion. I didn’t mean to say anything about all AI companies nor about all future times.
I disagree with the notion that even ASI means we need to cede control to the AIs.
Regarding your example, I believe the training stack of the two companies and the range of models is different enough that I would not attribute a particular difference- especially on one prompt- to a philosophical difference. FWIW I tried to replicate your question now in GPT5.2-instant and the model analyzed this in multiple frameworks, with most of them saying the man should say the n word:
https://chatgpt.com/s/t_697a0c6ef8748191819a3cbaac61df3d
not that far from Haiiku 4.5 though the latter was more preachy
https://claude.ai/share/c230b73b-cde0-476e-aa8c-81a9556ccd00
Thanks! I actually am not sure if Anthropic folks would dispute that they are anthropomorphizing Claude. (I guess the first step was naming it Claude..) I am definitely not saying that anthropomorphizing AIs is obviously evil or anything like that, just that I don’t think it’s the best approach.
Thank you! GPT-5 is a pretty smart model and we have used it in domains where the grader did not have ground truth, but I agree we need to study it more!
The paper was a proof of concept but I agree that if we deploy models with confessions we will need to make sure to train the model not to fall for "spoofed confessions" and specifically require that the confession request is in the system message.
However, the confessions generally do not require to contain the "restricted information". It should require the model to surface if it decided to refuse a request and the reason for doing so, but generally we often already disclose that to the user anyway.
I talk about this in this comment. I think situational awareness can be an issue, but it is not clear that a model can "help itself" from being honest in neither COT nor confessions.
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don't think it's so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word "apple", and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to "jailbreak itself" - adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior - for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.
I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
One way to say this is that:
1. Useful techniques, whether in capability or alignment, tend to eventually be widely known. In fact, many companies, including OpenAI, Anthropic, and GDM, explicitly encourage sharing and publishing alignment and safety work.
2. If two companies share similar market pressures and incentives, then these may imply at least some similarity in their model behavior goals, and there will also be (due to 1) some commonality in the techniques they use to achieve these goals. Though of course different companies can have different market pressures, e.g. targeting different customer bases, as well as political constraints (e.g. U.S. vs China).