The unemployment pool that resulted from this efficiency wage made it easier to discipline officers by moving them back to the captains list.
I don't understand this point or how it explains captains' willingness to fight.
DeepMind says boo SAEs, now Anthropic says yay SAEs![1]
Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say "this cluster of features seems to roughly correlate with this type of thing" and "the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors". But it looks like we're actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model's internal algorithms.
Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like "this mathematical object is computed by this function and is used in this algorithm". Even if we could still interpret circuits at this level of abstraction, humans probably couldn't hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won't be required for useful applications.
The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It's notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis.
This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.
Claude 3.7's annoying personality is the first example of accidentally misaligned AI making my life worse. Claude 3.5/3.6 was renowned for its superior personality that made it more pleasant to interact with than ChatGPT.
3.7 has an annoying tendency to do what it thinks you should do, rather than following instructions. I've run into this frequently in two coding scenarios:
I call this misalignment, rather than a capabilities failure, because it seems a step back from previous models and I suspect it is a side effect of training the model to be good at autonomous coding tasks, which may be overriding its compliance with instructions.
This means that the Jesus Christ market is quite interesting! You could make it even more interesting by replacing it with "This Market Will Resolve No At The End Of 2025": then it would be purely a market on how much Polymarket traders will want money later in the year.
It's unclear how this market would resolve. I think you meant something more like a market on "2+2=5"?
Then it will often confabulate a reason why the correct thing it said was actually wrong. So you can never really trust it, you have to think about what makes sense and test your model against reality.
But to some extent that's true for any source of information. LLMs are correct about a lot of things and you can usually guess which things they're likely to get wrong.
LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.
Yes. I plan to write down my views properly at some point. But roughly I subscribe to non-cognitivism.
Moral questions are not well defined because they are written in ambiguous natural language, so they are not truth apt. Now you could argue that many reasonable questions are also ambiguous in this sense. Eg the question "how many people live in Sweden" is ultimately ambiguous because it is not written in a formal system (ie. the borders of Sweden are not defined down to the atomic level).
But you could in theory define the Sweden question in formal terms. You could define arbitrarily at how many nanoseconds after conception a fetus becomes a person and resolve all other ambiguities until the only work left would be empirical measurement of a well defined quantity.
And technically you could do the same for any moral question. But unlike the Sweden question, it would be hard to pick formal definitions that everyone can agree are reasonable. You could try to formally define the terms in "what should our values be?". Then the philosophical question becomes "what is the formal definition of 'should'?". But this suffers the same ambiguity. So then you must define that question. And so on in an endless recursion. It seems to me that there cannot be any One True resolution to this. At some point you just have to arbitrarily pick some definitions.
The underlying philosophy here is that I think for a question to be one on which you can make progress, it must be one in which some answers can be shown to be correct and others incorrect. ie. questions where two people who disagree in good faith will reliably converge by understanding each other's view. Questions where two aliens from different civilizations can reliably give the same answer without communicating. And the only questions like this seem to be those defined in formal systems.
Choosing definitions does not seem like such a set of questions. So resolving the ambiguities in moral questions is not something on which progress can be made. So we will never finally arrive at the One True answer to moral questions.