I'm not familiar with this interpretation. Here's what Claude has to say (correct about stable regions, maybe hallucinating about Hopfield networks)

This is an interesting question that connects the findings in the paper to broader theories about how transformer models operate. Let me break down my thoughts:
The paper's findings and the Hopfield network interpretation of self-attention are not directly contradictory, but they're not perfectly aligned either. Let's examine this in more detail:
The paper's key findings:
The residual stream of trained transformers contains stable regions where small activation changes lead to minimal output changes.
These regions emerge during training and become more defined as training progresses or model size increases.
The regions appear to correspond to semantic

Jett Janiak1y

Characterizing stable regions in the residual stream of LLMs

I believe there are two phenomena happening during training

Predictions corresponding to the same stable region become more similar, i.e. stable regions become more stable. We can observe this in the animations.
Existing regions split, resulting in more regions.

I hypothesize that

could be some kind of error correction. Models learn to rectify errors coming from superposition interference or another kind of noise.
could be interpreted as more capable models picking up on subtler differences between the prompts and adjusting their predictions.

Characterizing stable regions in the residual stream of LLMs

Jett Janiak

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet, StefanHex

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here.

This video is a short overview of the project presented on the final day of the LASR Labs. Note that the paper was updated since then.

Visualization of stable regions in OLMo-7B during training. Colors represent the similarity of model outputs to those produced by three model-generated activations (red, green, blue circles). Each frame shows a 2D slice of the residual stream after the first layer at different stages of training. As training progresses, distinct regions of solid color emerge

... (read 248 more words →)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak, StefanHex

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Stefan Heimershiem (Apollo Research). Find out more about the programme and express interest in upcoming iterations here.

TL;DR

We find that real activations cannot be mimicked using a “bag of SAE latents” with no internal structure, and that geometric and statistical properties of SAE latents play an important role in the composition of a realistic activation (Figure 1).

Figure 1: The CDF of model sensitivity (over 1000 samples) for perturbations towards real, synthetic, and random activations. Synthetic-baseline activations (purple, simple “bags of SAE latents”) cannot match the behavior of real activations (orange), but synthetic-structured activations (green, taking into account

... (read 785 more words →)

Replying toAIS terminology proposal: standardize terms for probability ranges

Jett Janiak1y

AIS terminology proposal: standardize terms for probability ranges

Scott In Continued Defense Of Non-Frequentist Probabilities

Replying toTransformers Represent Belief State Geometry in their Residual Stream

Jett Janiak2y

Transformers Represent Belief State Geometry in their Residual Stream

This is such a cool result! I tried to reproduce it in this notebook

Replying toTransformers Represent Belief State Geometry in their Residual Stream

Jett Janiak2y

Transformers Represent Belief State Geometry in their Residual Stream

For the two sets of mess3 parameters I checked the stationary distribution was uniform.

AISC project: TinyEvals

Jett Janiak

Apply to work on this project with me at AI Safety Camp 2024 before 1st December 2023.

The project is not set in stone, I am looking for feedback!

Summary

TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size.

I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp).

The goal of this AISC project is to publish a paper that systematically identifies and characterises... (read 1131 more words →)

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak

Jett Janiak, cmathw, StefanHex

Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship

Epistemic status: optimized to get the post out quickly, but we are confident in the main claims

TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance

Introduction

In previous post about the docstring circuit, we found that attention head 1.4 (Layer 1, Head 4) in a 4-layer attention-only transformer would act as either a fuzzy previous token head or as an induction head in different parts of the prompt.
These results suggested that attention head 1.4 was polysemantic, i.e. performing different functions within different contexts.
In Section 1, we classify ~5 million rows of attention patterns associated with 5,000 prompts from the

... (read 1579 more words →)

Replying toA Comprehensive Mechanistic Interpretability Explainer & Glossary

Jett Janiak2y

A Comprehensive Mechanistic Interpretability Explainer & Glossary

The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can

Can, Yeu-Tong Lau, James Dao, Jett Janiak

Please check out our notebook for figure recreation and to examine your own model for clean-up behavior.

Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort

Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent on V-composition with the writer head]. The negative correlation coefficient $r$ suggests that output of a writer node is consistently removed from the residual stream by subsequent clean-up nodes. See section Implication for Direct Logit Attribution.

Overview

In this post, we provide concrete evidence for memory management or clean-up in a 4-layer transformer gelu-4l. We show examples where Direct Logit Attribution (DLA) is misleading because it does not account for the... (read 2326 more words →)

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex

StefanHex, Jett Janiak

Produced as part of the SERI ML Alignment Theory Scholars Program under the supervision of Neel Nanda - Winter 2022 Cohort.

TL;DR: We found a circuit in a pre-trained 4-layer attention-only transformer language model. The circuit predicts repeated argument names in docstrings of Python functions, and it features

3 levels of composition,
a multi-function head that does different things in different parts of the prompt,
an attention head that derives positional information using the causal attention mask.

Epistemic Status: We believe that we have identified most of the core mechanics and information flow of this circuit. However our circuit only recovers up to half of the model performance, and there are a bunch of leads we didn’t follow yet.

... (read 6297 more words →)

LESSWRONG
LW

LESSWRONG
LW

Jett Janiak

A circuit for Python docstrings in a 4-layer attention-only transformer

Polysemantic Attention Head in a 4-Layer Transformer

Characterizing stable regions in the residual stream of LLMs

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Jett Janiak

Jett Janiak

Characterizing stable regions in the residual stream of LLMs

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

AISC project: TinyEvals

Polysemantic Attention Head in a 4-Layer Transformer

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

A circuit for Python docstrings in a 4-layer attention-only transformer

Jett Janiak

A circuit for Python docstrings in a 4-layer attention-only transformer

Polysemantic Attention Head in a 4-Layer Transformer

Characterizing stable regions in the residual stream of LLMs

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Jett Janiak

Jett Janiak

Characterizing stable regions in the residual stream of LLMs

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

AISC project: TinyEvals

Polysemantic Attention Head in a 4-Layer Transformer

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

A circuit for Python docstrings in a 4-layer attention-only transformer

Summary

Introduction

Overview