LESSWRONG
LW

LorenzoPacchiardi — LessWrong

Hi, thanks for this interesting work. You may also be interested in our new work where we investigate whether internal linear probes (before an answer is produced) capture whether a model is going to answer correctly: https://www.lesswrong.com/posts/KwYpFHAJrh6C84ShD/no-answer-needed-predicting-llm-answer-accuracy-from

We also compare that with verbalised self-confidence and we find internals have more predictive power, so potentially you can apply internals to your setup

•••

Two Mathematical Perspectives on AI Hallucinations and Uncertainty

LorenzoPacchiardi

5mo

A recent OpenAI preprint (and blogpost) examines the sources of AI hallucinations. This reminded me of a 2024 preprint which was similar in scope. These two papers use different mathematical lenses but arrive at complementary conclusions about the necessity of uncertainty expressions in AI systems. I briefly review and compare them here.

[disclaimer: I am not an author of any of those papers, so my understanding may not be fully correct]

Paper 1: The Consistent Reasoning Paradox

The first paper examines AI systems through the lens of theoretical computer science. The authors define an AGI as a system (implemented on a Turing machine) that passes the Turing test, which implies it must reason consistently—answering the... (read 759 more words →)

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

antonghawthorne

antonghawthorne, ivanvmoreno, Arnau Padrés Masdemont, David Africa, LorenzoPacchiardi

5mo

TLDR: This is the abstract, introduction and conclusion to the paper. See here for a summary thread.

Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this “in-advance correctness direction” trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters... (read 921 more words →)

Replying toFeedback request: `eval-crypt` a simple utility to mitigate eval contamination.

LorenzoPacchiardi5mo

Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.

Hi Matan, yes, I think I would use it. I think it's already quite useful even without HuggingFace integration. For instance, we are creating an eval that consists of custom code (relying on Inspect) + text prompts, and I would gladly encrypt the text prompts.

Replying toFeedback request: `eval-crypt` a simple utility to mitigate eval contamination.

LorenzoPacchiardiAug 13, 2025

Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.

This seems useful, thanks. It won't clearly be sufficient to fully fix the issue (as people can always uncrypt the data and add it to training corpus), but it is a good protection against accidental inclusion.

I suppose this would gain more adoption if:

platforms such as HF would start using it alongside CAPTCHAs (so people can look at evals without downloading locally)
if it can be easily used alongside Inspect or other platforms (it may already be, not sure) -- or if inspect incorporates it and allows people defining a new eval to use it without effort.

As for the second point, have you considered patching Inspect?

Replying toBuilding Black-box Scheming Monitors

LorenzoPacchiardi6mo

Building Black-box Scheming Monitors

Thanks, this seems like a great work! A couple of comments:

metrics: as you correctly note, AUROC is threshold-agnostic, which is great for comparing across models. However, the other side of the coin of this is that it is insensitive to monotonic transformations of the assigned values. Thus, it is possible that, in an OOD case, AUROC remains high but the monitor assigns 10x smaller values to the samples. In operational scenarios, one would set a threshold on a known distribution (which may be different from the training one for the monitor) and then use it for the operational distribution. A high value of AUROC implies that you can determine a threshold value

LorenzoPacchiardi6mo

It's Owl in the Numbers: Token Entanglement in Subliminal Learning

Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the "background":

Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.

While it's true that you can't have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality.... (read more)

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

LorenzoPacchiardi2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Please let me know if that description is wrong!

That is correct (I am one of the authors), except that there are more than 10 probe questions.

Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

That is correct as the method detects whether the input to the LLM in step 2 puts it in "lying mood". Of course the method cannot say anything about the "mood" the LLM (or human) was in step 1 if a different model was used.