User Comment Replies

What’s up with LLMs representing XORs of arbitrary features?

If it's easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).

I'd strongly hope that after doing this, none of your probes would be significantly below 50%.

Aryan Bhatt2y35

$p \in [0, 1]$

Small nitpick, but is this meant to say $p \in (0, 1]$ instead? Because if $p = 0$ , then the axiom reduces to $L ⪰ M ⟺ N ⪰ N$ , which seems impossible to satisfy for all $L, M, N \in L$ (for nearly all preference relations).

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Aryan Bhatt2y20

My rough guess for Question 2.1:

The model likely cares about number of characters because it allows it to better encode things with fixed-width fonts that contain some sort of spatial structure, such as ASCII art, plaintext tables, 2-D games like sudoku, tic-tac-toe, and chess, and maybe miscellaneous other things like some poetry, comments/strings in code^[1], or the game of life.

A priori, storing this feature categorically is probably a far more efficient encoding/representation than linearly (especially since length likely has at most 10 commo

Aryan Bhatt2y50

[Note: One idea is to label the dataset w/ the feature vector e.g. saying this text is a latex $ and this one isn't. Then learn several k-sparse probes & show the range of k values that get you whatever percentage of separation]

You've already thanked Wes, but just wanted to note that his paper may be of interest here.

How much do you believe your results?

Aryan Bhatt2y50

If you're interested, "When is Goodhart catastrophic?" characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.

The downside is that because it doesn't make assumptions about the distributions (other than independence), it sadly can't say much about the non-limiting cases.

An exploration of GPT-2's embedding weights

Aryan Bhatt2y20

Very small typo: when you define LayerNorm, you say $y_{i} = \sum_{j} ({Id}_{i j} - 1_{i} 1_{j}) x_{j}$ when I think you mean $y_{i} = \sum_{j} ({Id}_{i j} - \frac{1_{i} 1_{j}}{768}) x_{j}$ ? Please feel free to ignore if this is wrong!!!

We Found An Neuron in GPT-2

Aryan Bhatt2y10

I do agree that looking at $W_{O}$ alone seems a bit misguided (unless we're normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:

At first blush, the thing you said is exactly right; scaling $W_{i n}$ up and scale $W_{O}$ down will leave the implemented function unchanged.
However, this'll affect the L2 regularization penalty. All else equal, we'd expect to see $∥ W_{i n} ∥ = ∥ W_{O} ∥$ , since that minimizes the regularization penalty.
However, this

Aryan Bhatt2y70

We are surprised by the decrease in Residual Stream norm in some of the EleutherAI models.
...
According to the model card, the Pythia models have "exactly the same" architectures as their OPT counterparts

I could very well be completely wrong here, but I suspect this could primarily be an artifact of different unembeddings.

It seemed to me from the model card that although the Pythia models have "exactly the same" architecture, they only have the same number of non-embedding parameters. The Pythia models all have more total parameters than their counter... (read more)

TurnTrout's shortform feed

Aryan Bhatt2yΩ230

Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."

3TurnTrout2y

Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn't very plausible, and the latter claim is... misdirecting of attention, and maybe too weak. Re: attention, I think that "does the agent end up aligned?" gets explained by the dataset more than by the reward function over e.g. hypothetical sentences. I think "reward/reinforcement numbers" and "data points" are inextricably wedded. I think trying to reason about reward functions in isolation is... a caution sign? A warning sign?

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Aryan Bhatt2y100

I guess that I'm imagining that the {presence of a representation of a path}, to the extent that it's represented in the model at all, is used primarily to compute some sort of "top-right affinity" heuristic. So even if it is true that, when there's no representation of a path, subtracting the {representation of a path}-vector should do nothing, I think that subtracting the "top-right affinity" vector that's downstream of this path representation should still do something regardless of whether there is or isn't currently a path representation.

So I gu... (read more)

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Aryan Bhatt2y40

And the top-right vector also transfers across mazes? Why isn't it maze-specific?

This makes a lot of sense if the top-right vector is being used to do something like "choose between circuits" or "decide how to weight various heuristics" instead of (or in addition to) actually computing any heuristic itself. There is an interesting question of how capable the model architecture is of doing things like that, which maybe warrants thinking about.^[1]

This could be either the type of thinking that looks like "try to find examples of this in the model by int... (read more)

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Aryan Bhatt2y30

The patches compose!

In the framework of the comment above regarding the add/subtract thing, I'd also be interested in examining the function diff(s,t) = f(input+t*top_right_vec+s*cheese_vec) - f(input).

The composition claim here is saying something like diff(s,t) = diff(s,0) + diff(0,t). I'd be interested to see when this is true. It seems like your current claim is that this (approximately) holds when s<0 and t>0 and neither are too large, but maybe it holds in more or fewer scenarios. In particular, I'm surprised at the weird hard boundaries at s=0 and t=0.

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

Aryan Bhatt2yΩ230

I wish I knew why.

Same.

I don't really have any coherent hypotheses (not that I've tried for any fixed amount of time by the clock) for why this might be the case. I do, however, have a couple of vague suggestions for how one might go about gaining slightly more information that might lead to a hypothesis, if you're interested.

The main one involves looking at the local nonlinearities of the few layers after the intervention layer at various inputs, by which I mean examining diff(t) = f(input+t*top_right_vec) - f(input) as a function of t (for small values o... (read more)

3TurnTrout2y

The current layer was chosen because I looked at all the layers for the cheese vector, and the current layer is the only one (IIRC) which produced interesting/good results. I think the cheese vector doesn't really work at other layers, but haven't checked recently.

3Aryan Bhatt2y

In the framework of the comment above regarding the add/subtract thing, I'd also be interested in examining the function diff(s,t) = f(input+t*top_right_vec+s*cheese_vec) - f(input). The composition claim here is saying something like diff(s,t) = diff(s,0) + diff(0,t). I'd be interested to see when this is true. It seems like your current claim is that this (approximately) holds when s<0 and t>0 and neither are too large, but maybe it holds in more or fewer scenarios. In particular, I'm surprised at the weird hard boundaries at s=0 and t=0.

Some common confusion about induction heads

Aryan Bhatt2y20

Copying: how much the head output increases the logit of [A] compared to the other logits.

Please correct me if I'm wrong, but I believe you mean [B] here instead of [A]?

1Alexandre Variengien2y

You're right, thanks for spotting it! It's fixed now.

Re-Examining LayerNorm

Aryan Bhatt2y20

From the "Conclusion and Future Directions" section of the colab notebook:

Most of all, we cannot handwave away LayerNorm as "just doing normalization"; this would be analogous to describing ReLU as "just making things nonnegative".

I don't think we know too much about what exactly LayerNorm is doing in full-scale models, but at least in smaller models, I believe we've found evidence of transformers using LayerNorm to do nontrivial computations^[1].

^{^}
I think I vaguely recall something about this in either Neel Nanda's "Rederiving Positional Encodings" stuff, o

... (read more)

Re-Examining LayerNorm

Aryan Bhatt2y21

Sorry for the mundane comment, but in the "Isolating the Nonlinearity" section of the colab notebook, you say

Note that a vector in $n$ dimensions with mean 0 has variance 1 if and only if it has length $\frac{1}{\sqrt{n}}$

I think you might've meant to say $\sqrt{n}$ there instead of $\frac{1}{\sqrt{n}}$ , but please do correct me if I'm wrong!!!

Brief Notes on Transformers

Aryan Bhatt2yΩ020

$W_{Q}^{T} W_{K} / d_{k}$

Sorry for the pedantic comment, but I think you might've meant to have $\sqrt{d_{k}}$ in the denominator here.

2Adam Jermyn2y

Ah that's right. Will edit to fix.

Toy Models and Tegum Products

Aryan Bhatt2yΩ010

Thanks for the great post! I have a question, if it's not too much trouble:

Sorry for my confusion about something so silly, but shouldn't the following be "when $α ⩽ 2$ "?

When $α \geq 2$ there is no place where the derivative vanishes

I'm also a bit confused about why we can think of $α$ as representing "which moment of the interference distribution we care about."

Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, $k = n e^{α / (2 - α)}$ , is an increasing function of $α$ , which ... (read more)

2Adam Jermyn2y

Oh you're totally right. And k=1 should be k=d there. I'll edit in a fix. It's not precisely which moment, but as we vary α the moment(s) of interest vary monotonically. This comment turned into a fascinating rabbit hole for me, so thank you! It turns out that there is another term in the Johnson-Lindenstrauss expression that's important. Specifically, the relation between ϵ, m, and D should be ϵ2/2−ϵ3/3≥4logm/D (per Scikit and references therein). The numerical constants aren't important, but the cubic term is, because it means the interference grows rather faster as m grows (especially in the vicinity of ϵ≈1). With this correction it's no longer feasible to do things analytically, but we can still do things numerically. The plots below are made with n=105,d=104: The top panel shows the normalized loss for a few different α≤2, and the lower shows the loss derivative with respect to k. Note that the range of k is set by the real roots of ϵ2/2−ϵ3/3≥4logm/D: for larger k there are no real roots, which corresponds to the interference ϵ crossing unity. In practice this bound applies well before k→d. Intuitively, if there are more vectors than dimensions then the interference becomes order-unity (so there is no information left!) well before the subspace dimension falls to unity. Anyway, all of these curves have global minima in the interior of the domain (if just barely for α=0.5), and the minima move to the left as α rises. That is, for α≤2 we care increasingly about higher moments as we increase α and so we want fewer subspaces. What happens for α>2? The global minima disappear! Now the optimum is always k=1. In fact though the transition is no longer at α=2 but a little higher: So the basic story still holds, but none of the math involved in finding the optimum applies! I'll edit the post to make this clear.

Why Agent Foundations? An Overly Abstract Explanation

Aryan Bhatt3y50

places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy

May I ask for a few examples of this?

The claim definitely seems plausible to me, but I can't help but think of examples like gravity or electromagnetism, where every theory to date has underestimated the messiness of the true concept. It's possible that these aren't really much evidence against the claim but rather indicative of a poor ontology:

People who expect a "clean" territory tend to be shocked by how "messy" the world looks when their ori

... (read more)

LESSWRONG
LW

All of Aryan Bhatt's Comments + Replies