lewis smith - LessWrong

How should TurnTrout handle his DeepMind equity situation?

your example agreement with a friend is obviously a derivative, which is just a contract whose value depends on the value of an underlying asset (google stock in this case). If it's not a formal derivative contract you might be less likely to get in trouble for it compared to doing it on robinhood or whatever (not legal advice!) but it doesn't seem like a very good idea.

How should TurnTrout handle his DeepMind equity situation?

lewis smith1d50

like at many public companies, google has anti-insider trading policies that prohibit employees from trading in options and other derivatives on the company stock, or shorting it.

Circuits in Superposition: Compressing many small neural networks into one

lewis smith2mo10

yeah that makes sense I think

Circuits in Superposition: Compressing many small neural networks into one

lewis smith2mo10

with later small networks taking the outputs of earlier small networks as their inputs.

what's the distinction between two small networks connected in series with the first taking the output of the previous one as input and one big network? what defines the boundaries of the networks here?

The ‘strong’ feature hypothesis could be wrong

lewis smith3mo52

I kind of agree that Dennett is right about this, but I think it's important to notice that the idea he's attacking - that all representation is explicit representation - is an old and popular one in philosophy of mind that was, at one point, seen as natural and inevitable by many people working in the field, and one which I think still seems somewhat natural and obvious to many people who maybe haven't thought about the counterarguments much (e.g I think you can see echos of this view in a post like this one, or the idea that there will be some 'intelligence algorithm' which will be a relatively short python program). The idea that a thought is always or mostly something like a sentence in 'mentalese' is, I think, still an attractive one to many people of a logical sort of bent, as is the idea that formalised reasoning captures the 'core' of cognition.

The ‘strong’ feature hypothesis could be wrong

lewis smith3mo10

I guess you are thinking about holes with the p-type semiconductor?

I don't think I agree (perhaps obviously) that it's better to think about the issues in the post in terms of physics analogies than in terms of the philosophy of mind and language. If you are thinking about how a mental representation represents some linguistic concept, then Dennett and Wittgenstein (and others!) are addressing the same problem as you! in a way that virtual particles are really not

lewis smith's Shortform

lewis smith3mo40

I think atom would be a pretty good choice as well but I think the actual choice of terminology is less important than the making the distinction. I used latent here because that's what we used in our paper

lewis smith's Shortform

lewis smith3mo5949

‘Feature’ is overloaded terminology

In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:

Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.
A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.
The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ‘the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE.

This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers - i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them.

We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.

The ‘strong’ feature hypothesis could be wrong

lewis smith3mo10

I'm not that confident about the statistical query dimension (I assume that's what you mean by SQ dimension?) But I don't think it's applicable; SQ dimension is about the difficulty of a task (e.g binary parity), wheras explicit vs tacit representations are properties of an implementation, so it's kind of apples to oranges.

To take the chess example again, one way to rank moves is to explicitly compute some kind of rule or heuristic from the board state, and another is to do some kind of parallel search, and yet another is to use a neural network or something similar. The first one is explicit, the second is (maybe?) more tacit, and the last is unclear. I think stronger variations of the LRH kind of assume that the neural network must be 'secretly' explicit, but I'm not really sure this is neccesary.

But I don't think any of this is really affected by the SQ dimension because it's the same task in all three cases (and we could possibly come up with examples which had identical performance?)

but maybe i'm not quite understanding what you mean

The ‘strong’ feature hypothesis could be wrong

lewis smith4mo10

yeah, I think this paper is great!

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments