Abstract

AI safety is a young science. In its early history, deep and speculative questions regarding the risks of artificial superintelligence attracted the attention of basic science and philosophy, but the near-horizon focus of industry and governance, coupled with shorter safety timelines, means that applied science is now taking a bigger hand in the field’s future. As the science of AI safety matures, it is critical to establish a research tradition that balances the need for foundational understanding with the demand for relevance to practical applications. We claim that “use-inspired basic research” – which grounds basic science with a clear-cut purpose – strikes a good balance between pragmatism and rigor that will ensure... (read 3604 more words →)

Replying toSAE feature geometry is outside the superposition hypothesis

Eric Winsor2y

SAE feature geometry is outside the superposition hypothesis

This reminded me of how GPT-2-small uses a cosine/sine spiral for its learned positional embeddings embeddings, and I don't think I've seen a mechanistic/dynamical explanation for this (just the post-hoc explanation that attention can use cosine similarity to encode distance in R^n, not that it should happen this way).

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Alexandre Variengien

Alexandre Variengien, Eric Winsor

This work was done as a Master's thesis project at Conjecture, independent from the primary agenda of the organization. Paper available here, thesis here.

Over the past months I (Alexandre) — with the help of Eric — have been working on a new approach to interpretability of language models (LMs). In the search for the units of interpretability, I decided to zoom out instead of zooming in. I focused on careful dataset design and causal intervention at a macro-level (i.e. scale of layers).

My goal has been to find out if there are such things as “organs”^[1] in LMs. In other words, are there macroscopic universal motifs, coarse-grained internal structures corresponding to a function that... (read 2953 more words →)

Basic Facts about Language Model Internals

beren

beren, Eric Winsor

This post was written as part of the work done at Conjecture.

As mentioned in our retrospective, while also producing long and deep pieces of research, we are also experimenting with a high iteration frequency. This is an example of this strand of our work. The goal here is to highlight interesting and unexplained language model facts. This is the first in a series of posts which will be exploring the basic ‘facts on the ground’ of large language models at increasing levels of complexity.

Understanding the internals of large-scale deep learning models, and especially large language models (LLMs) is a daunting task which has been relatively understudied. Gaining such an understanding of how large models... (read 2411 more words →)

131

Replying toFormalization as suspension of intuition

Eric Winsor3y

Formalization as suspension of intuition

I like this perspective! The idea of formalization as suspension of intuition reminds me of the story of the "Gruppenpest" in the development of quantum mechanics. The abstraction of groups (as well as representations and matrices) was seen by many as non-physical and unintuitive. But it turned out the resulting abstractions of gauge theories and symmetries were more fundamental objects than their predecessors.^[1]^[2]^[3]^[4]

It also reminds me of a view I've been told many times that mathematical formalization/modeling is the process of forgetting details about a problem until its essential character is laid bare. I think it's important to emphasize that formalization is only a partial suspension or redirection of intuition (which seems... (read more)

Replying toRe-Examining LayerNorm

Eric Winsor3y

Re-Examining LayerNorm

Thanks for the catch!

Re-Examining LayerNorm

Eric Winsor

Please check out the colab notebook for interactive figures and more detailed technical explanations.

This post is part of the work done at Conjecture.

Special thanks to Sid Black, Dan Braun, Carlos Ramón Guevara, Beren Millidge, Chris Scammell, Lee Sharkey, and Lucas Teixeira for feedback on early drafts.

There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and... (read 1205 more words →)

128

Interpreting Neural Networks through the Polytope Lens

Sid Black

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor, Dan Braun

Sid Black*, Lee Sharkey*, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

*equal contribution

Research from Conjecture.

This post benefited from feedback from many staff at Conjecture including Adam Shimi, Nicholas Kees Dupuis, Dan Clothiaux, Kyle McDonell. Additionally, the post also benefited from inputs from Jessica Cooper, Eliezer Yudkowsky, Neel Nanda, Andrei Alexandru, Ethan Perez, Jan Hendrik Kirchner, Chris Olah, Nelson Elhage, David Lindner, Evan R Murphy, Tom McGrath, Martin Wattenberg, Johannes Treutlein, Spencer Becker-Kahn, Leo Gao, John Wentworth, and Paul Christiano and from discussions with many other colleagues working on interpretability.

Summary

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts... (read 9824 more words →)

149

LESSWRONG
LW

LESSWRONG
LW

Eric Winsor

Interpreting Neural Networks through the Polytope Lens

Basic Facts about Language Model Internals

Re-Examining LayerNorm

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Eric Winsor

Eric Winsor

Toward Safety Case Inspired Basic Research

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Basic Facts about Language Model Internals

Re-Examining LayerNorm

Interpreting Neural Networks through the Polytope Lens

Eric Winsor

Interpreting Neural Networks through the Polytope Lens

Basic Facts about Language Model Internals

Re-Examining LayerNorm

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Eric Winsor

Eric Winsor

Toward Safety Case Inspired Basic Research

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Basic Facts about Language Model Internals

Re-Examining LayerNorm

Interpreting Neural Networks through the Polytope Lens

Abstract

Summary