Joseph Miller - LessWrong

Does this still seem wrong to you?

Yes. I plan to write down my views properly at some point. But roughly I subscribe to non-cognitivism.

Moral questions are not well defined because they are written in ambiguous natural language, so they are not truth apt. Now you could argue that many reasonable questions are also ambiguous in this sense. Eg the question "how many people live in Sweden" is ultimately ambiguous because it is not written in a formal system (ie. the borders of Sweden are not defined down to the atomic level).

But you could in theory define the Sweden question in formal terms. You could define arbitrarily at how many nanoseconds after conception a fetus becomes a person and resolve all other ambiguities until the only work left would be empirical measurement of a well defined quantity.

And technically you could do the same for any moral question. But unlike the Sweden question, it would be hard to pick formal definitions that everyone can agree are reasonable. You could try to formally define the terms in "what should our values be?". Then the philosophical question becomes "what is the formal definition of 'should'?". But this suffers the same ambiguity. So then you must define that question. And so on in an endless recursion. It seems to me that there cannot be any One True resolution to this. At some point you just have to arbitrarily pick some definitions.

The underlying philosophy here is that I think for a question to be one on which you can make progress, it must be one in which some answers can be shown to be correct and others incorrect. ie. questions where two people who disagree in good faith will reliably converge by understanding each other's view. Questions where two aliens from different civilizations can reliably give the same answer without communicating. And the only questions like this seem to be those defined in formal systems.

Choosing definitions does not seem like such a set of questions. So resolving the ambiguities in moral questions is not something on which progress can be made. So we will never finally arrive at the One True answer to moral questions.

Explaining British Naval Dominance During the Age of Sail

Joseph Miller18h30

The unemployment pool that resulted from this efficiency wage made it easier to discipline officers by moving them back to the captains list.

I don't understand this point or how it explains captains' willingness to fight.

Wei Dai's Shortform

Joseph Miller19h40

the One True Form of Moral Progress

Have you written about this? This sounds very wrong to me.

Tracing the Thoughts of a Large Language Model

Joseph Miller2d*213

DeepMind says boo SAEs, now Anthropic says yay SAEs!^[1]

Reading this paper pushed me a fair amount in the yay direction. We may still be at the unsatisfying level where we can only say "this cluster of features seems to roughly correlate with this type of thing" and "the interaction between this cluster and this cluster seems to mostly explain this loose group of behaviors". But it looks like we're actually pointing at real things in the model. And therefore we are beginning to be able to decompose the computation of LLMs in meaningful ways. The Addition Case Study is seriously cool and feels like a true insight into the model's internal algorithms.

Maybe we will further decompose these explanations until we can get down to satisfying low-level descriptions like "this mathematical object is computed by this function and is used in this algorithm". Even if we could still interpret circuits at this level of abstraction, humans probably couldn't hold in their heads all the relevant parts of a single forward pass at once. But AIs could or maybe that won't be required for useful applications.

The prominent error terms and simplifying assumptions are worrying, but maybe throwing enough compute and hill-climbing research at the problem will eventually shrink them to acceptable sizes. It's notable that this paper contains very few novel conceptual ideas and is mostly just a triumph of engineering schlep, massive compute and painstaking manual analysis.

^{^}
This is obviously a straw man of both sides. They seem to be thinking about it from pretty different perspectives. DeepMind is roughly judging them by their immediate usefulness in applications, while Anthropic is looking at them as a stepping stone towards ambitious moonshot interp.

Joseph Miller's Shortform

Joseph Miller3d64

Claude 3.7's annoying personality is the first example of accidentally misaligned AI making my life worse. Claude 3.5/3.6 was renowned for its superior personality that made it more pleasant to interact with than ChatGPT.

3.7 has an annoying tendency to do what it thinks you should do, rather than following instructions. I've run into this frequently in two coding scenarios:

In Cursor, I ask it to implement some function in a particular file. Even when explicitly instructed not to, it guesses what I want to do next and changes other parts of the code as well.
I'm trying to fix part of my code and I ask it to diagnose a problem and suggest debugging steps. Even when explicitly instructed not to, it will suggest alternative approaches that circumvent the issue, rather than trying to fix the current approach.

I call this misalignment, rather than a capabilities failure, because it seems a step back from previous models and I suspect it is a side effect of training the model to be good at autonomous coding tasks, which may be overriding its compliance with instructions.

Will Jesus Christ return in an election year?

Joseph Miller5d83

This means that the Jesus Christ market is quite interesting! You could make it even more interesting by replacing it with "This Market Will Resolve No At The End Of 2025": then it would be purely a market on how much Polymarket traders will want money later in the year.

It's unclear how this market would resolve. I think you meant something more like a market on "2+2=5"?

trevor's Shortform

Joseph Miller9d20

I read this and still don't understand what an acceptable target slot is.

Joseph Miller's Shortform

Joseph Miller10d20

Then it will often confabulate a reason why the correct thing it said was actually wrong. So you can never really trust it, you have to think about what makes sense and test your model against reality.

But to some extent that's true for any source of information. LLMs are correct about a lot of things and you can usually guess which things they're likely to get wrong.

Joseph Miller's Shortform

Joseph Miller11d2112

LLM hallucination is good epistemic training. When I code, I'm constantly asking Claude how things work and what things are possible. It often gets things wrong, but it's still helpful. You just have to use it to help you build up a gears level model of the system you are working with. Then, when it confabulates some explanation you can say "wait, what?? that makes no sense" and it will say "You're right to question these points - I wasn't fully accurate" and give you better information.

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Joseph Miller11d70

See No convincing evidence for gradient descent in activation space

LESSWRONG
LW

Posts

Wikitag Contributions

Comments