Eleni Angelou

A Problem to Solve Before Building a Deception Detector

TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions. 1. Introduction A commonly discussed AI risk scenario is strategic deception: systems that execute sophisticated planning against their creators to achieve undesired ends. In particular, this is insidious because a system that is capable of strategic planning and also situationally aware might be able to systematically behave differently when under observation, and thus evaluation methods that are purely behavioral could become unreliable. One widely hypothesized potential solution to this is to use interpretability, understanding the internals of the model, to detect such strategic deception. We aim to examine this program and a series of problems that appear on its way. We are primarily concerned with the following: * Strategic deception is an intentional description or an intentional state. By intentional state, we mean that it involves taking the intentional stance towards a system and attributing mental properties, like beliefs and desires, to it; for example, believing that it is raining is an intentional state, while being wet is not.[1] In contrast to this, current interpretability has focused on the algorithmic description of behaviors. The safety-relevant properties to detect would largely appear as strategic intentional states about potentially deceptive actions.[2]We argue that intentional states are an importantly different level of description from algorithmic states, and it is not clear how to describe the former in terms of the latter. We think that studying the connection between algorithmic description and intentional states has been underexplored, but it is likely an important prerequisite to building a de

78Feb 7, 2025

Eleni Angelou

Message

410

Large Language Models Live in Time

Crossposted from my Substack. Epistemic status: wild brainstorming. LLMs don’t seem to live in time. That is, they don’t seem to have a continuous personal identity the way we typically understand this for humans. My claim here, however, is that the question of whether LLMs somehow represent or track temporal...

Feb 920

Tools, Agents, and Sycophantic Things

Crossposted from my Substack. For more context, you may also want to read The Intentional Stance, LLMs Edition. Why Am I Writing This I recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the...

Dec 6, 202525

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Crossposted from my Substack. I spent the weekend at Lighthaven, attending the Eleos conference. In this post, I share thoughts and updates as I reflect on talks, papers, and discussions, and put out some of my takes since I haven't written about this topic before. I divide my thoughts into...

Nov 25, 202533

Notes on "Explaining AI Explainability"

Conor Griffin interviewed Been Kim and Neel Nanda and posted their discussion here. They address a series of important questions about what explaining AI systems should look like, the role of mechanistic interpretability, the problems with solving explainability through chain-of-thought reasoning, and using AIs to produce explanations. The following are...

Oct 24, 202520

Three Levels for Large Language Model Cognition

This is the abridged version of my second dissertation chapter. Read the first here. Thanks to everyone I've discussed this with, and especially, M.A. Khalidi, Lewis Smith, and Aysja Johnson. TL;DR: Applying Marr's three levels to LLMs seems useful, but quickly proves itself to be a leaky abstraction. Despite the...

Feb 25, 202521

A Problem to Solve Before Building a Deception Detector

Feb 7, 202578

Failure Modes of Teaching AI Safety

Why I'm writing this I'm about to teach my AI safety course for the fourth time. As I'm now updating the syllabus for the upcoming semester, I summarize my observations on what can go wrong when teaching AI safety. These have mostly not happened during my teaching but are generally...

Jun 25, 202420

Load More (7/31)

LESSWRONG
LW

LESSWRONG
LW

Eleni Angelou

Eleni Angelou

Eleni Angelou

A Problem to Solve Before Building a Deception Detector

I designed an AI safety course (for a philosophy department)

The Intentional Stance, LLMs Edition

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Eleni Angelou

Large Language Models Live in Time

Tools, Agents, and Sycophantic Things

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Notes on "Explaining AI Explainability"

Three Levels for Large Language Model Cognition

A Problem to Solve Before Building a Deception Detector

Failure Modes of Teaching AI Safety

A Problem to Solve Before Building a Deception Detector

I designed an AI safety course (for a philosophy department)

The Intentional Stance, LLMs Edition

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Large Language Models Live in Time

Tools, Agents, and Sycophantic Things

Takeaways from the Eleos Conference on AI Consciousness and Welfare

Notes on "Explaining AI Explainability"

Three Levels for Large Language Model Cognition

A Problem to Solve Before Building a Deception Detector

Failure Modes of Teaching AI Safety