Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing. Remaining gaps I've thought of so far:
Terminology proposal: scaffolding vs tooling.
I haven't seen these terms consistently defined with respect to LLMs. I've been using, and propose standardizing on:
Some smaller details:
Thanks to @Andy Arditi for helping me nail down the distinction.
If I desire a cookie, I desire to believe that I desire a cookie; if I do not desire a cookie, I desire to believe that I do not desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I desire a cookie, I desire to believe that I believe that I desire a cookie; if I do not believe that I desire a cookie, I desire to believe that I do not believe that I desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I believe that I desire a cookie, I desire to believe that I believe that I believe that I desire a cookie; if I do not believe that I believe that I desire a cookie, I desire to believe that I do not believe that I believe that I desire a cookie; let me not become attached to beliefs I may not want.
If I believe that...
Two interesting things from this recent Ethan Mollick post:
People can’t detect AI writing well. Editors at top linguistics journals couldn’t. Teachers couldn’t (though they thought they could - the Illusion again). While simple AI writing might be detectable (“delve,” anyone?), there are plenty of ways to disguise “AI writing” styles through simples prompting. In fact, well-prompted AI writing is judged more human than human writing by readers.
Thoughts on a passage from OpenAI's GPT-o1 post today:
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
This section is interesting in a few ways:
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they're avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
'We also do not want to make an unaligned chain of thought directly visible to users.' Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
Much is made of the fact that LLMs are 'just' doing next-token prediction. But there's an important sense in which that's all we're doing -- through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it's tokens. There's an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they're myopic in a way that we're not. But as far as the prediction piece, I'm not sure there's a strong difference in kind.
Would you disagree? If so, why?
I've been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, "2026 or 2027...there could be a mild delay")
Here's a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it's just straightforwardly able to do the job of a researcher). Suppose further that it's not a significant additional compute expense; let's say that OAI can immediately deploy a million instances.
What outcome would you expect? Let's operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn't end disastrously?
If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don't take any strong position on), their consciousness would be much less like a human's than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.
I've now made two posts about LLMs and 'general reasoning', but used a fairly handwavy definition of that term. I don't yet have a definition I feel fully good about, but my current take is something like:
What am I missing? Where does this definition fall short?
A thought: the bulk of the existential risk we face from AI is likely to be from smarter-than-human systems. At a governance level, I hear people pushing for things like:
but not
Why not? It seems like a) a particularly clear and bright line to draw[1], b) something that a huge amount of the public would likely support, and c) probably(?) easy to pass because most policymakers imagine this to be in the distant future. The biggest downside I immediately see is that it sou...
It's not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here's log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
(a comment I made in another forum while discussing my recent post proposing more consistent terminology for probability ranges)
I think there's a ton of work still to be done across the sciences (and to some extent other disciplines) in figuring out how to communicate evidence and certainty and agreement. My go-to example is: when your weather app says there's a 30% chance of rain tomorrow, it's really non-obvious to most people what that means. Some things it could mean:
There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.
With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.
But what would that even be with ...
[Epistemic status: thinking out loud]
Many of us have wondered why LLM-based agents are taking so long to be effective and in common use. One plausible reason that hadn't occurred to me until now is that no one's been able to make them robust against prompt injection attacks. Reading an article ('Agent hijacking: The true impact of prompt injection attacks') today reminded me of just how hard it is to defend against that for an agent out in the wild.
Counterevidence: based on my experiences in startup-land and the industry's track record with Internet of Thi...
GPT-o1's extended, more coherent chain of thought -- see Ethan Mollick's crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn't go so far as to say I expect it (I haven't even played with o1-preview yet), but it seems significantly more likely than previous LLM models.
Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We...
In some ways it doesn't make a lot of sense to think about an LLM as being or not being a general reasoner. It's fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won't. They're both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: 'I have a block named C on top of a block named A. A is on tabl...
This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.
I do believe some algorithmic improvements matter, but I don't think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don't have them.
@jacob_cannell's timeline and model is here:
@Bogdan Ionut Cirstea's timeline and models are here:
https://x.com/BogdanIonutCir2/status/1827707367154209044
Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That's hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be futu...
Something I'm grappling with:
From a recent interview between Bill Gates & Sam Altman:
Gates: "We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?"
Altman: "A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…"
To the extent that a par...