All of Alexandre Variengien's Comments + Replies

In section 5, I explain how CoEm is an agenda with relaxed constraints. It does try to reduce the alignment tax to make the safety solution competitive for lab to use. Instead it considers there's enough advance in international governance that you have full control over how your AI get built and that there's enforcement mechanism to ensure no competitive but unsafe AI can be built somewhere else.

That's what the bifurcation of narrative is about: not letting lab implement only solution that have low alignment tax because this could just not be enough.

My steelman of Conjecture's position here would be:

  • Current evals orgs are tightly integrated with AGI labs. AGI labs can pick which evals org to collaborate with, control the model access, which kind of evals will be conducted, which kind of report will be public, etc. This is this power position that makes current evals feed into AGI orthodoxy.
  • We don't have good ways to conduct evals. We have wide error bars over how much juice one can extract from models and we are nowhere close to having the tools to upper bound capabilities from evals. I remember this
... (read more)
2Fabien Roger
Agreed that it's sad if AI labs control what evals are being run. Doesn't seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs). I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail: * Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it's likely you'd get caught. * Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don't think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it's starts to bite in worlds where the best models are open sourced). * Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don't happen prior to humans eliciting dangerous capabilities in the usual ways. I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!

I really appreciate the naturalistic experimentation approach – the fact that it tries to poke at the unknown unknowns, discovering new capabilities or failure modes of Large Language Models (LLMs).

I'm particularly excited by the idea of developing a framework to understand hidden variables and create a phenomenological model of LLM behavior. This seems like a promising way to "carve LLM abilities at their joint," moving closer to enumeration rather than the current approach of 1) coming up with an idea, 2) asking, "Can the LLM do this?" and 3) testing it.... (read more)

What I really like about ancient language is that there's no online community the model could exploit. Even low-ressource modern languages have online forums an AI could use as an entry point.

But this consideration might be eclipsed by the fact that a rogue AI would have access to a translator before trying online manipulation, or by another scenario I'm not considering.

Agree with the lack of direct access to CoT being one of the major drawback. Though we could have a slightly smarter reporter that could also answer questions about CoT interpretation.

One could also imagine asking a group of Sumerian experts to craft new words for the occasion such that the updated language has enough flexibility to capture the content of modern datasets.

9Archimedes
Why not go all the way and use a constructed language (like Lojban or Ithkuil) that's specifically designed for the purpose?

Thanks for your comment, these are great questions!

  1. I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it's hard to know from the vectors alone what are the differences that ma

... (read more)

B* 3.22

It seems to be a duplicate of problem 3.18.

Thanks for this rich analogy! Some comments about the analogy between context window and RAM:

Typo in the model name

GPT3 currently has an 8K context or an 8kbit RAM (theoretically expanding to 32kbit soon). This gets us to the Commodore 64 in digital computer terms, and places us in the early 80s.

I guess you meant GPT4 instead of GPT3.

Equivalence token to bits

Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalen... (read more)

3beren
Thanks for these points!  My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit -- but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of 'semantic bit' idealisation which needs to be worked out. I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked -- did they already have a register -> RAM -> disk distinction?  This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff. Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which comm
4Tao Lin
>Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?   LLMs make very inneficient use of their context size because they're writing human-like text which is predictable. Human text is like 0.6 bits/byte, so maybe 2.5 bits per token. Text used in language model scaffolding and such tends to be even more predictable (by maybe 30%)

I don't have a confident answer to this question. Nonetheless, I can share related evidence we found during REMIX (that should be public in the near future).

We defined a new measure for context sensitivity relying on causal intervention.  We measure how much the in-context loss of the model increases when we replace the input of a given head with a modified input sequence, where the far-away context is scrubbed (replaced by the text from a random sequence in the dataset).  We found heads in GPT2-small that are context-sensitive according to this ... (read more)

You're right, thanks for spotting it! It's fixed now. 

I recently applied causal scrubbing to test the hypothesis outlined in the paper (as part of my work at Redwood Research). The hypothesis was defined from the circuit presented in Figure 2. I used a simple setting similar to the experiments on Induction Heads. I used two types of inputs:

  •  the correct input for the circuit.
  • , an input with the same template but a randomized subject and indirect object. Used as input for the path not included in the circuit.

Results

Experiment 1

I allowed all MLPs on every path of the circuit. The only attention h... (read more)

This is an important point, but it also highlights how the concept of gliders is almost tautological. Any sequence of entangled causes and effects could be considered a glider, even if it undergoes superficial transformations.

I agree with this. I think that the most useful part of the concept is to force making the difference between the "superficial transformations" and the "things that stays".

I also think that it's useful to think about text features that are not (or unlikely to be) gliders like 

  • The tone of a memorized quote
  • A random date chosen to f
... (read more)
2Gunnar_Zarncke
Actually, I tried out the in-line comment function for this. Nice and easy. I often see minor errors and would use this more but I wonder whether it will clutter the comments.

Thanks for your comment!

1.

Looking at your example, “​​Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I'm confused. You say "duplicating the IO token in a distractor sentence", but I thought David would be the IO here?

Am I confused about the meaning of the IO or was there just a typo in the example?

You are right, there is a typo here. The correct sentence is “​​Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizab... (read more)

Thanks for the feedback!

Does this mean that it writes a projection of S1's positional embedding to S2's residual stream?  Or is it meant to say "writing to the position [residual stream] of [S2]"?  Or something else?

Our current hypothesis is that they write some information about S1's position (that we called the "position signal", not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph "Locating the position signal." in section 3.3). I hope this answer your questions.

We currently think... (read more)