A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography). If a model is penalized when a monitor catches unwanted reasoning, RL implicitly...
I've spent the last month thinking through what kind of research is needed in the area of CoT monitorability. To make it possible for others to critique and build on my thinking, I've written my up thoughts in a paper. In this paper, I lay out: * How CoT monitoring...
This post gives a brief overview and some personal thoughts on a new ICLR workshop paper that I worked on together with Seamus.. TLDR: * We developed a new method for automatically labeling features in neural networks using token-space gradient descent * Instead of asking an LLM to generate hypotheses...
Epistemic status: I wrote the post mostly for myself, in the process of understanding the theory behind Singular Learning Theory, based on the SLT low lecture series. This is a proof sketch with the thoroughness level of an experimental physics lecture: enough to get the intuitions but consult someone else...
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...
Alignment by Default is the idea that achieving alignment in artificial general intelligence (AGI) may be more straightforward than initially anticipated. When an AI possesses a comprehensive and detailed world model, it inherently represents human values within that model. To align the AGI, it's merely necessary to extract these values...
This post was written as part of the AI safety Mentors and Mentees program. My Mentor is Jacques Thibodeau. In this post, I will distinguish between two hypotheses that are often conflated. To disambiguate, I first suggest two different names for these hypotheses so I can talk about them separately:...