This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs),...
Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper. Tl;dr: The H3 paper * Proposes an architecture for...
I recently gave a technical AI safety research overview talk at EAGx Berlin. Many people told me they found the talk insightful, so I’m sharing the slides here as well. I edited them for clarity and conciseness, and added explanations. Outline This presentation contains 1. An overview of different research...
This is a short impression of the AI Safety Europe Retreat (AISER) 2023 in Berlin. Tl;dr: 67 people working on AI safety technical research, AI governance, and AI safety field building came together for three days to learn, connect, and make progress on AI safety. Format The retreat was an...
Finite factored sets are a new paradigm for talking about causality. You can use them to do some cool things you can’t do with Pearl’s causal graphs, for example inferring a causal arrow between two binary variables. Also, finite factored sets are a really neat mathematical structure: they are a...