Crossposted from my Substack. Epistemic status: wild brainstorming. LLMs don’t seem to live in time. That is, they don’t seem to have a continuous personal identity the way we typically understand this for humans. My claim here, however, is that the question of whether LLMs somehow represent or track temporal...
Crossposted from my Substack. For more context, you may also want to read The Intentional Stance, LLMs Edition. Why Am I Writing This I recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the...
Crossposted from my Substack. I spent the weekend at Lighthaven, attending the Eleos conference. In this post, I share thoughts and updates as I reflect on talks, papers, and discussions, and put out some of my takes since I haven't written about this topic before. I divide my thoughts into...
Conor Griffin interviewed Been Kim and Neel Nanda and posted their discussion here. They address a series of important questions about what explaining AI systems should look like, the role of mechanistic interpretability, the problems with solving explainability through chain-of-thought reasoning, and using AIs to produce explanations. The following are...
This is the abridged version of my second dissertation chapter. Read the first here. Thanks to everyone I've discussed this with, and especially, M.A. Khalidi, Lewis Smith, and Aysja Johnson. TL;DR: Applying Marr's three levels to LLMs seems useful, but quickly proves itself to be a leaky abstraction. Despite the...
TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions....
Why I'm writing this I'm about to teach my AI safety course for the fourth time. As I'm now updating the syllabus for the upcoming semester, I summarize my observations on what can go wrong when teaching AI safety. These have mostly not happened during my teaching but are generally...