Sonia Joseph

Getting PhD in multimodal interpretability and alignment at Mila.

Ok, thank you for your openness. I find that in-person conversations about sensitive matters like these are easier as tone, facial expression, body language are very important here. It is possible that my past comments on EA that you refer to came off as more hostile than intended due to the text-based medium.

Fwiw, the contents of this original post actually have nothing to do with EA itself, or the past articles that mentioned me.

Hi habryka,

Thank you for your comment. It contains a few assumptions that are not quite true. I am not sure that the comment section here is the best place to address them, and in person diplomacy may be wise. I would be down to get coffee the next time we are in the same city and discuss in more detail.

Apologies, the post is still getting approved by the EA forum as I've never posted there under this account.

Disclaimer: I am not writing this message in connection to my employer, my institution, or any third party. This is a personal judgement call, exercised solely in my own capacity.

Summary

Over the past few months, I have been involved in supporting insider reports about misconduct in AGI frontier labs. In particular, I’ve been supporting the victim of a crime perpetrated by an AGI frontier lab leader.

I am reaching out to the AI safety and governance community for support regarding their legal case, which has significant implications for AI development.

Details

I have known the crime victim well for many years, and they have earned my highest trust. After intensive discussions with them, I can attest... (read 592 more words →)

Good note, thank you.

Cross-posted here.

A note to the LW community

I wrote this post after spending my summer in the MATS community looking at sparse autoencoders on vision models, and then joining FAIR/Meta as a visiting researcher on their video generation team. It has been absolutely fascinating navigating both the cultural and research differences, but that is a subject for another post. (During my time at MATS, I got into several lively debates with other cohort members about going to Meta after the program. I would be more than happy to discuss my rationale for doing so in a separate post! But for this post, I will keep the focus on the research itself, and I... (read 4419 more words →)

TL;DR

We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves.

Abstract

Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making tasks is critical to ensuring that such systems operate transparently and safely. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We aim to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task - crafting a diamond pickaxe. The agent

... (read more)

Thanks for your comment. Some follow-up thoughts, especially regarding your second point:

There is sometimes an implicit zeitgeist in the mech interp community that other modalities will simply be an extension or subcase of language.

I want to flip the frame, and consider the case where other modalities may actually be a more general case for mech interp than language. As a loose analogy, the relationship between language mech interp and multimodal mech interp may be like the relationship between algebra and abstract algebra. I have two points here.

Alien modalities and alien world models

The reason that I’m personally so excited by non-language mech interp is due to the philosophy of language (Chomsky/Wittgenstein). I’ve been... (read 772 more words →)

Thank you for this. How would you think about the pros/cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model?

Right now, there's a lot to exploit with CLIP and ViTs so that will be the focus for awhile. We may expand to Flamingo or other models if there is demand.

Other modalities would be fascinating. I imagine they have their own idiosyncrasies. I would be interested in audio in the future but not at the expense of first exploiting vision.

Ideally, yes; a unified interp framework for any modality is the north star. I do think this will be a community effort. Research in language built off findings from many different groups and institutions. Vision and other modalities are currently just not in the same place.

It was surprising to me too. It is possible that the layers do not have aligned basis vectors. That's why corroborating the results with a TunedLens is a smart next step, as they currently may be misleading.

Noted, and thank you for flagging. I mostly agree, and do not have much to add (as we seem mostly in agreement that diverse, bluesky research is good), other than this may shape the way I present this project going forward.

Behold the dogit lens. Patch-level logit attribution is an emergent segmentation map.

Join our Discord here.

This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards’s lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi.

Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback.

Outline

Part One: Introduction and Motivation
Part Two: Tutorial Notebooks
Part Three: Brief ViT Overview
Part Four: Demo of Prisma’s Functionality
- Key features, including logit attribution, attention head visualization, and activation patching.
- Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads.
Part Five: FAQ,

... (read 4132 more words →)

Thank you for this write-up!

I am wondering how to relate causal scrubbing to @Arthur Conmy's ACDC method.

It seems that causal scrubbing becomes relevant when one is testing a relatively specific hypothesis (e.g. induction heads), while ACDC can work with simply a dataset, metric, behavior? If so, would it be accurate to say that ACDC would be a more general pass, and part of an earlier workflow, to develop your hypothesis? And causal scrubbing can validate it? Curious about trade-offs in types of insight, resources, computational complexity, positioning in one's mech interp workflow, and in what circumstances one would use each.

LESSWRONG
LW

LESSWRONG
LW

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Sonia Joseph

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Sonia Joseph

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Sonia Joseph

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader

Bridging the VLM and mech interp communities for multimodal interpretability

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Summary

Details

A note to the LW community

TL;DR

Abstract

Outline