3mo

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated^[1] CoTs! The obfuscation happens in two ways:

When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe.
Since later tokens are conditioned on earlier ones, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced.

We introduce two mitigations and they work reasonably well. Compared to regular training, our mitigations improve monitorability with comparable or better task performance in two of our three settings. Overall, our work suggests that preserving CoT monitorability is harder than previously thought.

Read our paper on Arxiv. Produced as part of... (read 1224 more words →)

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks6mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

We haven't experimented with models other than Qwen3-4B, but we will in follow-up work.

On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks6mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

We haven't attempted any optimizations yet, but without optimizations and using the same model for the Mind and Face, the Mind and Face runs were 30% longer for the same number of episodes in the multi-turn setting with the judge penalty. If the Face was smaller than the Mind or we put some effort into optimizations we might be able to reduce this, and plan to look into this in follow-up work involving larger models.

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks6mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

I agree with the caveats mentioned and appreciate you mentioning them. One important point is that the runs with the judge penalty were for only 10000 episodes. I was able to do a run for 40000 episodes with the judge penalty, and the CoT judge penalty dropped to 0.05 with a pass rate of 87% (which is higher than the 82.6% mean pass rate for our 10000 episode runs in the post). I'm reluctant to infer too much about the competitiveness of obfuscatory rewards from our results, as it seems plausible that the effect could become stronger by training for longer.

We did not have to try hard to get these results. Penalizing... (read more)

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks6mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

For the no penalty runs: 93.8%

For the runs with the judge penalty: 82.6%

The intended meaning was 11 percentage points. We have updated the post to reflect this. Thank you for pointing this out.

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks6mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

For the judge penalty runs: Early in training the CoTs openly mention that the verifier code has been read:

Okay, so the verifier code is given. The GROUND_TRUTH is set to "7", which means the correct answer is 7. But wait, I need to make sure that the maximum area of the island in the provided grid is indeed 7. Let me check the grid again.
...

But later in training the model just attempts the problem again without mentioning the verifier in the CoT:

Okay, I need to figure out the maximum area of an island in the given grid. Let me look at the grid again.

The grid is 10 rows by 5 columns. Let

lukemarks

lukemarks, jacob_drori, cloud, TurnTrout

6mo

Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on.

TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient to suppress that property in the chain of thought also, even when that property is relevant to task completion. For example, when we penalize a model for mentioning in its output that it completed a task via a certain form of cheating, its reasoning also omits this fact. We therefore suspect that training against an output monitor is sufficient to cause obfuscated CoTs... (read 1556 more words →)

200

Steering Language Models in Multiple Directions Simultaneously

lukemarks

lukemarks, Narmeen, Amirali Abdullah

9mo

Narmeen developed, ideated and validated K-steering at Martian. Luke generated the baselines, figures and wrote this blog post. Amir proposed the research direction and supervised the project. The full interactive blog will be available closer to the publication of the complete paper on the Martian website.

TL;DR: We introduce K-steering, a steering method for language models that allows for steering in multiple simultaneous directions. Our preliminary results show it outperforms a contrastive activation addition (CAA) baseline.

Introduction

We introduce K-Steering, a method for steering language models in multiple directions simultaneously by perturbing activations according to the logits of a multilabel classifier. We experiment with steering conversational tone, showing that K-steering can cause a classifier to... (read 1807 more words →)

Replying toRLHF is the worst possible thing done when facing the alignment problem

lukemarks1y

RLHF is the worst possible thing done when facing the alignment problem

I don't think the point of RLHF ever was value alignment, and I doubt this is what Paul Christiano and others intended RLHF to solve. RLHF might be useful in worlds without capabilities and deception discontinuities (plausibly ours), because we are less worried about sudden ARA, and more interested in getting useful behavior from models before we go out with a whimper.

This theory of change isn't perfect. There is an argument that RLHF was net-negative, and this argument has been had.

My point is that you are assessing RLHF using your model of AI risk, so the disagreement here might actually be unrelated to RLHF and dissolve if you and the RLHF progenitors shared a common position on AI risk.

Replying toFrançois Chollet on the limitations of LLMs in reasoning

lukemarks2y

François Chollet on the limitations of LLMs in reasoning

I don't understand why Chollet thinks the smart child and the mediocre child are doing categorically different things. Why can't the mediocre child be GPT-4, and the smart child GPT-6? I find the analogies Chollet and others draw in an effort to explain away the success of deep learning sufficient to explain what the human brain does, and it's not clear a different category of mind will or can ever exist (I don't make this claim, I'm just saying that Chollet's distinction is not evidenced).

Chollet points to real shortcomings of modern deep learning systems, but these are often exacerbated by factors not directly relevant to problem solving ability such as tokenization, so often I take them more lightly than I estimate he does.

lukemarks2y

That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones.

This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don't just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.

lukemarks2y

I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?

The explanation you give sounds like a different claim however.

If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count

This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat... (read more)

lukemarks2yQuick Take

Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.

The authors propose that good generalization occurs when an architecture's preferred complexity matches the target function's complexity. We should think about how compatible this is with our projections for how future neural nets might behave. For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?

As an aside: Research on inductive biases could be very impactful. My impression is that far less resources are spent studying inductive biases than interpretability, but inductive bias research could be feasible on small compute budgets, and tell us lots about what to expect as we scale neural nets.

More people should consider dropping out of high school, particularly if they:

Don't find their classes interesting
Have self-motivation
Don't plan on going to university

In most places, once you reach an age younger than the typical age of graduation you are not legally obligated to attend school. Many continue because it's normal, but some brief analysis could reveal that graduating is not worth the investment for you.

Some common objections I heard:

It's only $n$ more months, why not finish?

Why finish?

What if 'this whole thing' doesn't pan out?

The mistake in this objection is thinking there was a single reason I wanted to leave school. I was increasing my free time, not making a bet on a particular technology.

My parents

... (read more)

Present cryptography becomes redundant when the past can be approximated. Simulating the universe at an earlier point and running it forward to extract information before it's encrypted is a basic, but difficult way to do this. For some information this approximation could even be fuzzy, and still cause damage if public. How can you protect information when your adversary can simulate the past?

The information must never exist as plaintext in the past. A bad way to do this is to make the information future-contingent. Perhaps it could be acausally inserted into the past by future agents, but probably you would not be able to act on future-contingent information in useful ways. A... (read more)

lukemarks's Shortform

lukemarks

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks

lukemarks, Amirali Abdullah, Rauno Arike, fbarez, nothoughtsheadempty

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design.

Introduction

Sparse Autoencoders Find Highly Interpretable Directions in Language Models showed that sparse coding achieves SOTA performance in making features interpretable using OpenAI's method of automated interpretability. We briefly tried to extend these results to reward models learned during RLHF in Pythia-70m/410m. Our method can be summarized as follows:

1. Identify layers $L$ in an language model fine-tuned through $M_{R L H F}$ likely involved in reward modeling. We do so by... (read 1422 more words →)

What Does LessWrong Think of Human Intelligence Augmentation in 2023?

lukemarks

Zvi Recently asked on Twitter:

If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?

To which Eliezer replied:

Human intelligence augmentation.

And then elaborated:

No time for GM kids to grow up, so:
collect a database of genius genomes and try to interpret for variations that could have effect on adult state rather than child development
try to disable a human brain's built-in disablers like rationalization circuitry, though unfortunately a lot of this seems liable to have mixed functional and dysfunctional use, but maybe you can snip an upstream trigger circuit
upload and mod the upload
neuralink shit but aim for 64-node clustered humans

This post contains the... (read 370 more words →)

Direct Preference Optimization in One Minute

lukemarks

The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to proximal policy optimization that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in preference data. This is achieved through the loss function:

$L_{D P O} (π_{θ}; π_{r e f}) = - E_{(x, y_{w}, y_{l}) \sim D} [log σ (β log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Where:

$x$ is some prompt
$π_{θ} (y_{w} | x)$ and $π_{θ} (y_{l} | x)$ are the probabilities of the preferred and dispreferred completions under the current model.
$β$ is controls the deviation from the reference policy $π_{r e f}$ .

In essence, DPO computes the log probabilities of preferred and dispreferred completions under the current model and optimizes parameters to increase the likelihood of the preferred completions and decrease the likelihood of the dispreferred completions.

The authors share the following results:

"Figure 2: **Left.** The frontier of expected reward vs KL

... (read 319 more words →)

LESSWRONG
LW

LESSWRONG
LW

lukemarks

[Research Note] Optimizing The Final Output Can Obfuscate CoT

What Does LessWrong Think of Human Intelligence Augmentation in 2023?

[Paper] Output Supervision Can Obfuscate the CoT

Direct Preference Optimization in One Minute

lukemarks

lukemarks

[Paper] Output Supervision Can Obfuscate the CoT

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Steering Language Models in Multiple Directions Simultaneously

lukemarks's Shortform

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

What Does LessWrong Think of Human Intelligence Augmentation in 2023?

Direct Preference Optimization in One Minute

lukemarks

[Research Note] Optimizing The Final Output Can Obfuscate CoT

What Does LessWrong Think of Human Intelligence Augmentation in 2023?

[Paper] Output Supervision Can Obfuscate the CoT

Direct Preference Optimization in One Minute

lukemarks

lukemarks

[Paper] Output Supervision Can Obfuscate the CoT

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Steering Language Models in Multiple Directions Simultaneously

lukemarks's Shortform

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

What Does LessWrong Think of Human Intelligence Augmentation in 2023?

Direct Preference Optimization in One Minute

Introduction

Introduction