Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
TL;DR We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves. Abstract > Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making...