In section 5, I explain how CoEm is an agenda with relaxed constraints. It does try to reduce the alignment tax to make the safety solution competitive for lab to use. Instead it considers there's enough advance in international governance that you have full control over how your AI get built and that there's enforcement mechanism to ensure no competitive but unsafe AI can be built somewhere else.
That's what the bifurcation of narrative is about: not letting lab implement only solution that have low alignment tax because this could just not be enough.
My steelman of Conjecture's position here would be:
I really appreciate the naturalistic experimentation approach – the fact that it tries to poke at the unknown unknowns, discovering new capabilities or failure modes of Large Language Models (LLMs).
I'm particularly excited by the idea of developing a framework to understand hidden variables and create a phenomenological model of LLM behavior. This seems like a promising way to "carve LLM abilities at their joint," moving closer to enumeration rather than the current approach of 1) coming up with an idea, 2) asking, "Can the LLM do this?" and 3) testing it....
What I really like about ancient language is that there's no online community the model could exploit. Even low-ressource modern languages have online forums an AI could use as an entry point.
But this consideration might be eclipsed by the fact that a rogue AI would have access to a translator before trying online manipulation, or by another scenario I'm not considering.
Agree with the lack of direct access to CoT being one of the major drawback. Though we could have a slightly smarter reporter that could also answer questions about CoT interpretation.
Thanks for your comment, these are great questions!
I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it's hard to know from the vectors alone what are the differences that ma
Thanks for this rich analogy! Some comments about the analogy between context window and RAM:
Typo in the model name
GPT3 currently has an 8K context or an 8kbit RAM (theoretically expanding to 32kbit soon). This gets us to the Commodore 64 in digital computer terms, and places us in the early 80s.
I guess you meant GPT4 instead of GPT3.
Equivalence token to bits
Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalen...
I don't have a confident answer to this question. Nonetheless, I can share related evidence we found during REMIX (that should be public in the near future).
We defined a new measure for context sensitivity relying on causal intervention. We measure how much the in-context loss of the model increases when we replace the input of a given head with a modified input sequence, where the far-away context is scrubbed (replaced by the text from a random sequence in the dataset). We found heads in GPT2-small that are context-sensitive according to this ...
I recently applied causal scrubbing to test the hypothesis outlined in the paper (as part of my work at Redwood Research). The hypothesis was defined from the circuit presented in Figure 2. I used a simple setting similar to the experiments on Induction Heads. I used two types of inputs:
Experiment 1
I allowed all MLPs on every path of the circuit. The only attention h...
This is an important point, but it also highlights how the concept of gliders is almost tautological. Any sequence of entangled causes and effects could be considered a glider, even if it undergoes superficial transformations.
I agree with this. I think that the most useful part of the concept is to force making the difference between the "superficial transformations" and the "things that stays".
I also think that it's useful to think about text features that are not (or unlikely to be) gliders like
Thanks for your comment!
1.
Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I'm confused. You say "duplicating the IO token in a distractor sentence", but I thought David would be the IO here?
Am I confused about the meaning of the IO or was there just a typo in the example?
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizab...
Thanks for the feedback!
Does this mean that it writes a projection of S1's positional embedding to S2's residual stream? Or is it meant to say "writing to the position [residual stream] of [S2]"? Or something else?
Our current hypothesis is that they write some information about S1's position (that we called the "position signal", not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph "Locating the position signal." in section 3.3). I hope this answer your questions.
We currently think...
Nope, I didn't know PaCMAP! Thanks for the pointer, I'll have a look.