To be clear, I'm probably not (highly) conterfactual for other work in this area. As I note in this post:
While I expect that the sort of proposal I discuss here is well known, there are many specific details I discuss here which I haven't seen discussed elsewhere. If you are reasonably familiar with this sort of proposal, consider just reading the “Summary of key considerations” section which summarizes the specific and somewhat non-obvious points I discuss in this post.
I think this idea has been independently invented many times; this post mostly adds additional consideration and popularization.
(TBC, I tenatively believe this post had a reasonably large impact, but through these mechanisms not by inventing the idea!)
Well designed needle in a haystack should be hard to do "normal" software. (Like maybe a very weak LLM can notice the high perplexity, but I think it should be very hard to write a short ("normal") python function that doesn't use libraries and does this.)
The fit is notably better for "cumulative investment over time". Years still produces a slightly better fit.
I've cut off the fit as of 2010, about when the original version of moore's law stops. If you try to project out after 2010, then I think cumulative investment would do better, but I think only because of investment slowing in response to moore's law dying.
(Doing the fit to a lagged-by-3 years investment series doesn't make any important difference.)
Easy for claude to bypass though (it does actually fetch a pdf that you can directly download from Anthropic's servers).
Here is a url where you can download the PDF.
I asked Claude to download the PDF mostly as an experiment, here is its explanation for how it did this:
Claude's explanation
Vanta trust portals use a JavaScript single-page application that doesn't expose direct PDF download links in the HTML. The PDF URL is only revealed through network traffic when the page loads.
I used Puppeteer to intercept network requests with content-type: application/pdf:
page.on('response', async response => {
const contentType = response.headers()['content-type'] || '';
if (contentType.includes('pdf')) {
console.log('Found PDF:', response.url());
}
});
When the page loads, it fetches the PDF for its viewer, revealing the actual URL with these parameters:
rid - the document's internal IDr - the trust report slug (from the HTML's data-slugid attribute)view=true - returns the PDF contentThe model must respond in immediately with the answer. Any prefix would result in the answer being wrong. So only "[movie]" would be correct.
(Other than a few fixed generic ones like "the answer is" or "answer:" that I strip away, but the models virtually never output these, so considering these incorrect wouldn't alter the results.)
I think it's more that LLMs were interestingly bad at n-hop latent reasoning and it might be generally representative of a certain type of within-single-forward weakness of current architectures. It's not clear exactly what "n-hop" reasoning effectively about scheming without this information in context corresponds to.
I do think that relevant information won't be in context during the best opportunities for misaligned AIs to sabotage, escape, or otherwise covertly achieve their aims, but the information the AIs need will be much more closesly associated rather than being unrelated hops.
Note that this linked video is talking about tokens that the LLM picks rather than fixed tokens.
I think the distribution of errors for numerical n-hop questions is going to be uninteresting/random most of the time because the only questions with numerical answers are either "day of month X was born" or "number of counties in US state X" where there isn't any real reason for AIs to be close if they are wrong (either about X or the property of X)".
However, I was interested in this question for "adding the result of N 1-hop questions", so I got opus to do some additional analysis (for Gemini 3 Pro with 300 filler tokens):
The most interesting result here is that even on 6 addends, the model gets 75% within +-10 even though the answers are pretty big (median is 288 and many answers are much bigger).
Also, for some reason the model is systematically low, especially for 3 and 4 addends. Maybe because it sometimes skips one of the numbers?
Apparently, similar work on prompt repetition came out a bit before I published this post (but after I ran my experiments) and reproduces the effect. It looks like this paper doesn't test the "no output / cot prior to answer setting" and instead tests the setting where you use non-reasoning LLMs (e.g. Opus 4.5 without extended thinking enabled).