A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
This project has been completed as part of the Mentorship in Alignment Research Students (MARS London) programme under the supervision of Bogdan-Ionut Cirstea, on investigating the promise of automated AI alignment research. I would like to thank Bogdan-Ionut Cirstea, Erin Robertson, Clem Von Stengel, Alexander Gietelink Oldenziel, Severin Field, Aaron Scher, and everyone who commented on my draft, for the feedback and encouragement which helped me create this post. TL;DR The mechanism behind in-context learning is an open question in machine learning. There are different hypotheses on what in-context learning is doing, each with different implications for alignment. This document reviews the hypotheses which attempt to explain in-context learning, finding some overlap and good explanatory power from each, and describes the implications these hypotheses have for automated AI alignment research. Introduction Since their capabilities have improved and their size has increased, large language models (LLMs) have started demonstrating novel behaviours when prompted with natural language. Pre-trained LLMs can effectively carry out a range of behaviours when prompted. While predicting the most probable next token in a sequence is somewhat well understood, LLMs display another interesting behaviour, in-context learning, which is less easy to understand from the standpoint of traditional supervised learning. In-context learning is an emergent behaviour in pre-trained LLMs where the model seems to perform task inference (learn to do a task) and to perform the inferred task, despite only having been trained on input-output pairs in the form of prompts. The model does this without changing its parameters/weights, contrary to traditional machine learning. A Visual Example of ICL from The Stanford AI Lab Blog, Rong (2021) In traditional supervised learning, a model’s weights are changed using an optimisation algorithm such as gradient descent. The main reason why ICL is a s