Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word "physics" itself. Just don't overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
I welcome any discussion of the linked papers in the comments section.
I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.
This seems very useful -- thanks for doing it!
Some paper suggestions:
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit
There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse parities of n bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with nO(k) examples, with loss (and error) curves abruptly dropping after nO(k) iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD “stumbling in the dark” until it finds the hidden set of features (a natural algorithm which also runs in nO(k) time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
By some of the same authors as "Functions of Increasing Complexity," this paper takes a toy problem which exhibits a very sharp phase transition and analyzes the hell out of it. A primary aim of the paper is to refute the explanation that this phase transition happens because SGD is randomly searching weight space until stumbles upon a solution, an explanation which is tempting given that the loss curves for this problem stay essentially flat before rapidly converging to zero. Instead, the authors find that their models make "hidden progress" which is not reflected in the loss curves; this echos findings from Neel Nanda's work on grokking. (Speaking of which, this paper also abounds with fascinating tidbits on grokking, including a gearsy analysis of for which variants on their toy problem and which model hyperparameters grokking does/doesn't occur.)
Diversify and Disambiguate: Learning from Underspecified Data
Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.
I'm suggesting this paper because it forms the technical basis for Stuart Armstrong's work on his concept extrapolation agenda. Given (1) labeled data for which there are many possible proxies which result in good classification performance, (2) unlabeled data which bears witness to the fact that these proxies can disagree with each other, this paper gives a method for explicitly learning multiple diverse proxies. A possible story about how this is useful for alignment: if one can generate diverse hypotheses which postdict observed human preferences but disagree on novel scenarios, then one may hope to either actively query humans (to get evidence on which hypothesis is correct) or act conservatively (picking actions which are good according to all the various hypotheses).
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
Even when a pretrained model has learned robust features for understanding its data, finetuning can sometimes distort these features intro proxies which only work well on-distribution, thereby leading to poor OOD performance. This paper studies the "feature distortion" phenomenon, and proposes a method to mitigate it: first freeze the pretrained model's weights and only train a linear probe out of the pretrained model's latent representation of the data, and then finetune the pretrained model + linear probe. This seems relevant to alignment insofar as it seems plausible that large self-supervisedly trained models could learn concepts which robustly correspond to our own concepts; in that case it'd be useful if we could avoid distorting these concepts during finetuning (such as RLHF).
[Meta: if you include these papers in future roundups, feel free to either use these blurbs or toss them out and write your own. I had originally planned to just write something which pitched the papers and their alignment relevance to you (Quintin), but I guess they kinda turned more into the sort of opinions you had written.]
I have been studying the ROME code in depth now for a few months, and I think there's some really interesting stuff there. One of the contributors is actively working on adding things to their repo, such as integrating the functionality of nostalgebraist's logit lens.
I've just read the Gradient Starvation paper. It was quite interesting and approachable. It has left me wondering what we will see if (when?) someone applies their fix (spectral decoupling) to fine-tune a pre-trained language model...
Reading the Diversify and Disambiguate: Learning from Underspecified Data paper, I find myself thinking that it is solving nearly the same problem as the Gradient Starvation problem. The Gradient Starvation fix of Spectral Decoupling seems on face-value, a little simpler, more elegant and general. I don't actually have the info I need to compare the two though. I would like to see the two methods compared on things like:
I read the loss landscape paper, and liked their visualizations. I would be so interested to see the same visualizations applied to models trained with and without the Gradient Starvation paper's Spectral Decoupling.
Reading the Cognitive Psych on GPT-3 paper, I find myself highly suspecting that the correct answers GPT-3 is reported to get are due to memorized patterns. I'm curious whether repeating the large-n experiments with variations in prompt (e.g. 'you are a brilliant scientist presented with this question' vs 'you are an average middle school student presented with this question') would change the outcome?
'Mapping language models to conceptually grounded spaces' shows the valuable insight that GPT-3 175B can work well in this grid-world space, but the smaller versions can't. This is interesting because I had just been wondering earlier today if a small simple ascii-grid RL environment could be handled by a large language model. The answer is: maybe, if they're large enough. Probably not if they're small. More details on that idea here: https://docs.google.com/document/d/1sKMI2eNgLkRdxeg7m_XsFsXhIZx5LhXFxbi7NEFLy7k/edit?usp=sharing
Introduction
I've decided to start a weekly roundup of papers that seem relevant to alignment, focusing on papers or approaches that might be new to safety researchers. Unlike the Alignment Newsletter, I'll be spending relatively little effort on summarizing the papers. I'll just link them, copy their abstracts, and potentially describe some of my thoughts on how the paper relates to alignment. Hopefully, this will let me keep to a weekly schedule.
The purpose of this series isn't so much to share insights directly with the reader, but instead to make them aware of already existing research that may be relevant to the reader's own research.
Papers
Locating and Editing Factual Associations in GPT
My opinion:
Most people I talk to about this paper have heard of it previously, so it's hardly ”new”. However, I think a lot of people underestimate how significant the paper is. The authors use a very cool interpretability method to show that the middle-stage MLP layers are acting as a key-value memory system. They then guess at the specific mathematical structure these MLP layers use to store information, derive a closed-form, analytic solution to edit the model's knowledge stores and use very thorough evaluations to show that their knowledge editing method is effective and that the edits influence the model's outputs in many different contexts where the new knowledge is relevant. This paper is vastly beyond just "poke random stuff and see that the output changes". Code can be found here.
Using cognitive psychology to understand GPT-3
My opinion:
This paper performs a systematic analysis of GPT-3's capabilities through prompting, and measures some alignment-relevant capabilities such as understanding of causal interventions and active exploration to find useful knowledge. The authors make their code available here.
Mapping Language Models to Grounded Conceptual Spaces
My opinion:
This paper investigates whether language models learn non-linguistic concepts that they can adapt in-context to navigate worlds whose rules are described to them through text. It seems relevant for understanding the degree to which language models are able to do generalizeable world modeling and to understand or influence non-linguistic domains with the text they generate.
Gradient Starvation: A Learning Proclivity in Neural Networks
My opinion:
We often wonder whether behaviors / values / alignment properties that we instill early in training (when models are weak enough to supervise) will persist later in training. I think gradient starvation could be an important part of that puzzle, since it provides a concrete mechanism for how features learned early in training could persist. It also suggests a fair degree of path-dependence in SGD trajectories, and that guiding early training could have significant effects on the downstream models. Code here.
Related: Path dependence in ML inductive biases
SGD on Neural Networks Learns Functions of Increasing Complexity
My opinion:
Given the high path dependence world I think we live in, it becomes quite important to understand the order in which neural nets learn features / behaviors. This paper investigates that question for image models. Code here.
Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets
My opinion:
Another investigation into the order of feature learning, this time systematically comparing across models and architectures. We may be able to get a better handle on NN inductive biases by investigating the orders in which different architectures learn different types of data.
The Grammar-Learning Trajectories of Neural Language Models
My opinion:
I thought I should probably include a paper on feature learning order in language models, to balance out the previous three paper's focus on images. Code available here.
Learning through atypical ”phase transitions” in overparameterized neural networks
My opinion:
I see this paper as contradicting Mingard et al.'s Is SGD a Bayesian sampler? Well, almost. Mingard argued that SGD has little inductive bias, meaning that training on a dataset with SGD would give you a solution very similar to just sampling random networks until you found one that solved the dataset. This paper instead argues that SGD has extremely high inductive bias, and that SGD finds very "atypical" solutions that generalize much better than those that random sampling would find.
Exploring the Geometry and Topology of Neural Network Loss Landscapes
My opinion:
I include this paper because it provides tools to better visualize neural network training trajectories and loss landscapes. They also made their code public at this repository. It seems like a useful thing to check out if you're investigating NN training processes.
Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
My opinion:
Seems like an even stronger version of Adversarial Examples Are Not Bugs, They Are Features. Not only are (some) adversarial examples exploiting genuinely useful classification features, the exploited features are often correlates of the "true" features we humans use to classify images.
Going forward
I hope that list was helpful for some people. If you have a paper that seems alignment relevant (especially a paper that's not well known in alignment circles), please feel free to link it in the comments. Also feel free to share any other feedback or comments you have on the papers I did link.
I hope to produce one of these lists every week or so. I doubt I'll be able to do 10 papers a week, however. We'll see how it goes, I guess.