LESSWRONG
LW

Wuschel Schulz — LessWrong

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

12h

A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography).

If a model is penalized when a monitor catches unwanted reasoning, RL implicitly selects for whatever lets the model reason without being caught.

Our original goal was to elicit encoded reasoning so we could develop defenses against it.

We constructed an RL environment where encoded reasoning is the highest-reward strategy, trained models in it, and studied what emerged. The models did not learn encoded reasoning. Instead, they learned to phrase their chains of thought in ways that cause... (read 1293 more words →)

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz

4mo

I've spent the last month thinking through what kind of research is needed in the area of CoT monitorability. To make it possible for others to critique and build on my thinking, I've written my up thoughts in a paper.

In this paper, I lay out:

How CoT monitoring can be integrated into safety cases
The technical developments that are prerequisites for this working
Concrete research proposals for methods that need to be developed
Prediction markets tracking the likelihood of key technical developments over the next 5 years

The rest of this post is a rough summary of the claims I make in the paper.

How to Integrate CoT Monitoring into Safety Cases

Reasoning models rely on their chain-of-thought for... (read 1141 more words →)

[Paper] Automated Feature Labeling with Token-Space Gradient Descent

Wuschel Schulz

9mo

This post gives a brief overview and some personal thoughts on a new ICLR workshop paper that I worked on together with Seamus..

TLDR:

We developed a new method for automatically labeling features in neural networks using token-space gradient descent
Instead of asking an LLM to generate hypotheses about what a feature means, we directly optimize the feature label itself
The method successfully labeled features like "animal," "mammal," "Chinese," and "number" in our proof-of-concept experiments
Current limitations include single-token labels, issues with hierarchical categories, and computational cost
We are not developing this method further, but if someone wants to pick up this research, we would be happy to assist

In this project, we developed a proof of concept for... (read 1004 more words →)

A short 'derivation' of Watanabe's Free Energy Formula

Wuschel Schulz

Epistemic status: I wrote the post mostly for myself, in the process of understanding the theory behind Singular Learning Theory, based on the SLT low lecture series. This is a proof sketch with the thoroughness level of an experimental physics lecture: enough to get the intuitions but consult someone else for details.

Background: Statistical Learning Theory

In the framework of statistical learning theory, we aim to establish a model that can predict an outcome $y$ given an input $x$ with a certain level of confidence. This prediction is typically governed by a set of parameters $w$ , which need to be learned from the data. Let us begin by defining the primary components of this framework.

The Probabilistic Model

The probabilistic model... (read 1838 more words →)

Steering Llama-2 with contrastive activation additions

Nina Panickssery

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, TurnTrout

The effects of subtracting or adding a "sycophancy vector" to one bias term.

TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!

Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors:

Hallucination
Sycophancy
Corrigibility
Power-seeking
Cooperating with other AIs
Myopia
Shutdown acceptance.

These vectors are^[1] highly effective, as rated by Claude 2:

Adding steering vectors to layer 15 of Llama-2-13b-chat.

We find that the technique generalizes better than finetuning while only slightly decreasing... (read 2172 more words →)

125

Simulators Increase the Likelihood of Alignment by Default

Wuschel Schulz

Alignment by Default is the idea that achieving alignment in artificial general intelligence (AGI) may be more straightforward than initially anticipated. When an AI possesses a comprehensive and detailed world model, it inherently represents human values within that model. To align the AGI, it's merely necessary to extract these values and direct the AI towards optimizing the abstraction it already comprehends.

In a summary of this concept, John Wentworth estimates a 10% chance of this strategy being successful, a perspective I generally agree with.

However, in light of recent advancements, I have revised my outlook, now believing that Alignment by Default has a higher probability of success, perhaps around 30%. This update was prompted... (read 1379 more words →)

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz

This post was written as part of the AI safety Mentors and Mentees program. My Mentor is Jacques Thibodeau.

In this post, I will distinguish between two hypotheses that are often conflated. To disambiguate, I first suggest two different names for these hypotheses so I can talk about them separately:

The natural abstraction hypothesis (NAH):

There are natural ways to cut the world up into concepts. A lot of very different cognitive systems will naturally converge to these abstractions. So there is reason to believe that AIs will also form concepts of abstractions that humans use (nails, persons, human values….).

The Wentworthian abstractions hypothesis (WAH):

There are natural abstractions, and they are identified by the properties of... (read 1192 more words →)

A caveat to the Orthogonality Thesis

Wuschel Schulz

This post relies on an understanding of two concepts: The Orthogonality thesis and the sharp left turn. If you already know what they are, skip to the main text.

Orthogonality thesis

The orthogonality thesis states that an agent can have any combination of intelligence and goals. It is one of the core assumptions of alignment research.

Sharp left turn

The sharp left turn is a hypothesized event, where the capabilities of an AI suddenly generalize to new domains without its alignment capabilities generalizing. This process is sometimes described as “hitting the core of intelligence” and is considered to be the crucial point of alignment by some, as AIs after a sharp left turn might be more... (read 436 more words →)

Who is doing Cryonics-relevant research?

Wuschel Schulz

Wuschel Schulz, Andy_McKenzie

As part of my Biophysics Master, I am looking for a lab, to do a 3 months internship next winter.

I think one of the highest-impact applications of biophysics would be to research new&better methods for cryonic freezing, so I would love to get some insight into that field.

Do you, dear LW collective, have any groups you recommend doing exciting research in that area?

Due to the "pseudo-scientific" connotations of that whole topic, I do not feel entirely comfortable talking the whole thing over with my profs.

There is a line in the sand, just not where you think it is

Wuschel Schulz

There is a mistake in updating beliefs around norms that I have seen multiple times in different contexts made by different people. This is my attempt to classify this (fallacy? /mistake?), in order to give it a name and make it easier to refer to.

A friend of mine was recently honestly surprised that two government politicians got into trouble for personally taking money from mask-producing companies in return for securing government orders. He was not surprised that the politicians were corrupt. He was surprised that there were any repercussions. He thought that politicians could totally get away with this. How did he acquire that belief? A while earlier it was revealed that... (read 413 more words →)