Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ). Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field. We also think participants will learn skills valuable for many styles of interpretability research, and also for ML research more broadly. Apply here by Sunday, November 13th [DEADLINE EXTENDED] to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response. Some key details: * We expect to accept 30-50 participants. * The research program will take place in Berkeley, CA. * We plan to have the majority of researchers participate during the months of December and/or January (depending on availability) although we may invite some to begin earlier and are open to some starting as soon as possible. * We expect researchers to participate for a month minimum, and (all else equal) will prefer applicants who are able to come for longer. We’ll pay for housing and travel, and also pay researchers for their time. We’ll clarify the payment structure prior to asking people to commit to the program. * We
"When something quite bad can occur in a small number of actions (potentially even in a single trajectory) or in a short period of time, it's important to sometimes block suspicious looking actions until they are reviewed."
Why? I think there is not perfect overlap between the categories “single actions that are catastrophic, i.e. high stakes failures” and “actions that require monitoring and blocking in the moment”. One example where they come apart is “producing a batch of data that would cause the next generation of models to be misaligned”—this might be a high-stakes failure, but you have plenty of time to catch it before you actually train that model. Another example would be "writing code that is catastrophically insecure but only gets run rarely".