I hope you are open to some words of advice from a 51 year old part-time professor.
I will not try to convince you in a short comment that its extremely likely that you will have 30s, 40s, and probably even 100s and 110s. But I will just say that I believe it is the case. You can decide what value to place on my belief.
To the extent the rate of change makes you more ambitious that is great! Ambition and focus are fantastic. I am happy you are solving problems, and you are reading. Since I am a professor, I hope you still go to lecture- but of course choose which courses to attend. Sometimes one can read very advanced material in physics, math, CS, etc. and feel like you understand these, but this can be a superficial understanding when you skip the foundations, work through problem sets etc.. I think some courses can be worth the time even in our exponentially growing time - I certainly tried to make mine this semester.
I think it's a great time to be building things and solving problems, and I don't think we will run out of problems that fast, nor run out of the need for thoughtful, talented, and ambitious young people.
This critique is similar to what I wrote here that “ChatGPT might sometimes give the wrong answer, but it doesn't do the equivalent of becoming an artist when its parents wanted it to go to med school.” - the controls we have in training, including in particular the choice of data, are far more refined than evolution and indeed AIs would not have been so successful commercially if our training methods were not able to control their behavior to a large degree.
More generally I also agree with the broader point that the nature of the Y&S argument is somewhat anti empirical. We can try to debate whether the particular examples of failures of current LLMs are “human” or “alien” but I’m pretty sure that even if none of these examples existed, it would not cause Y&S to change their mind. So these, or any other evidence from currently existing systems that are not super intelligent, is not really load bearing for their theory.
I think your timelines were too aggressive but I wouldn’t worry about the title too much. If by the end of 2027, AI progress is significant enough that no one thinks it’s on track to staying a “normal technology” then I don’t think anyone would hold the 2027 title against you. And if that’s not the case, then titling it AI 2029 wouldn’t have helped.
Our prompt is fixed in all experiments and quite detailed - you can see the schema in appendix D. We ask the model to give a JSON object consisting of the objectives - implicit and explicit constraints, instructions etc that the answer should have satisfied in context - an analysis of compliance with them, as well as surfacing any uncertainties.
I’d expect that if we ran the same experiment as in Section 4 but without training for confessions then confession accuracy will be flat (and not growing as it was in the case where we did train for it). We will consider doing it though can’t promise that we will since it is cumbersome for some annoying technical reasons.
I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
Thank you Sam for writing on our work.
Some comments:
1. You are absolutely right that the evaluations we used were not chosen to induce dishonest confessions, but rather only "bad behavior" in the main answer.
2. You are also right that (as we say in the paper) even "out of the box" GPT-5-Thinking confesses quite well, without training.
3. We do see some positive improvement even from the light training we do. As Gabe says, super blatant lying in confessions is quite rare. But we do see improvement in the quality of confessions. It may be fitting to the confession grader, but they do also seem to be better and more precise. Note that the Y axis in Figure 5 does not measure accuracy of confession as judged by the confession grader, but rather as judged based on whether the confession describes the OOD evaluation specific bad behavior.
Based on these results, I am cautiously optimistic about Mechanism 2 - "honest-only output channel" and I think it does challenge your "Takeaway 2". But I agree we need more work.
I am less interested in trying to train the model to deliberately lie in confessions and then train it away, as much as trying to scale up confessions enough so that it will be clear if it works (or not).
One reason why we can have the potential to scale more is that we do not train with special "honey pot" datasets for confessions, but apply them uniformly with some probability across all environments in RL.
You’re right :) there is an “uncanny valley” right now and I hope we will exit it soon
To be clear this is what I did - I downloaded the PDF from the link Adria posted and copy pasted your prompt into both ChatGPT-5.1-Thinking and codex . I was just too lazy to check if these quotes are real
I tried to replicate but actually without access to the plain text of the doc it is a bit hard to know if the quotes are invented or based on actual OCR. FWIW GPT-5.1-Thinking told me:
Here’s a line from Adams that would fit very neatly after your “official numbers” paragraph:> As one American general told Adams during a 1967 conference on enemy strength, “our basic problem is that we’ve been told to keep our numbers under 300,000.”
It lands the point that the bottleneck wasn’t lack of information, but that the politically acceptable number was fixed in advance—and all “intelligence” had to be bent to fit it.
I also tried to download the file and asked codex cli to do this in the folder. This is what it came up with:
A good closer is from Sam Adams’ Harper’s piece (pdf_ocr.pdf, ~pp. 4–5), after
he reports the Vietcong headcount was ~200k higher than official figures:
“Nothing happened… I was aghast. Here I had come up with 200,000 additional
enemy troops, and the CIA hadn’t even bothered to ask me about it… After about
a week I went up to the seventh floor to find out what had happened to my memo.
I found it in a safe, in a manila folder marked ‘Indefinite Hold.’” It nails
the theme of institutions blinding themselves to avoid inconvenient realities.
Inspired by your post, I wrote this one.