Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say "this cluster of features seems to roughly correlate with this type of thing" and "the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors". But it looks like we're actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model's internal algorithms.

Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like "this mathematical object is computed by this function and is used in this algorithm". Even if we could still interpret circuits at this level of abstraction, humans probably couldn't hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won't be required for useful applications.

The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It's notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis.

^{^}
This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.

Reply

Joseph Miller's Shortform

Joseph Miller2d64

Claude 3.7's annoying personality is the first example of accidentally misaligned AI making my life worse. Claude 3.5/3.6 was renowned for its superior personality that made it more pleasant to interact with than ChatGPT.

3.7 has an annoying tendency to do what it thinks you should do, rather than following instructions. I've run into this frequently in two coding scenarios:

In Cursor, I ask it to implement some function in a particular file. Even when explicitly instructed not to, it guesses what I want to do next and changes other parts of the code as well.
I'm trying to fix part of my code and I ask it to diagnose a problem and suggest debugging steps. Even when explicitly instructed not to, it will suggest alternative approaches that circumvent the issue, rather than trying to fix the current approach.

I call this misalignment, rather than a capabilities failure, because it seems a step back from previous models and I suspect it is a side effect of training the model to be good at autonomous coding tasks, which may be overriding its compliance with instructions.

Reply

Will Jesus Christ return in an election year?

Joseph Miller5d83

This means that the Jesus Christ market is quite interesting! You could make it even more interesting by replacing it with "This Market Will Resolve No At The End Of 2025": then it would be purely a market on how much Polymarket traders will want money later in the year.

It's unclear how this market would resolve. I think you meant something more like a market on "2+2=5"?

Reply

trevor's Shortform

Joseph Miller9d20

I read this and still don't understand what an acceptable target slot is.

Reply

Joseph Miller's Shortform

Joseph Miller10d20

Then it will often confabulate a reason why the correct thing it said was actually wrong. So you can never really trust it, you have to think about what makes sense and test your model against reality.

But to some extent that's true for any source of information. LLMs are correct about a lot of things and you can usually guess which things they're likely to get wrong.

Reply

Joseph Miller's Shortform

Joseph Miller10d2112

LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.

Reply

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Joseph Miller11d70

See No convincing evidence for gradient descent in activation space

Reply

kave's Shortform

Joseph Miller11d138

It's not really feasible for the feature to rely on people reading this PSA to work well. The correct usage needs to be obvious.

Reply