Cameron Berg

Interim Research Report: Mechanisms of Awareness

8mo

A quick post on a probably-real inadequate equilibrium mostly inspired by trying to think through what happened to Chance the Rapper.

Potentially ironic artifact if it accrues karma.

1. The sculptor's garden

A sculptor worked in solitude for years, carving strange figures in his remote garden. Most of his statues failed: some cracked in winter, others looked wrong against the landscape. But occasionally, very rarely, one seemed to work.

The first visitors stumbled upon the garden by accident. They found themselves stopped by his angels—figures that somehow held both sorrow and joy, wings that seemed about to flitter.

Word traveled slowly. More visitors came, drawn by something they couldn't quite name.

The sculptor felt recognized for the first... (read 936 more words →)

Replying toInterim Research Report: Mechanisms of Awareness

Cameron Berg9mo

Nice work. To me, this seems less like evidence that self-awareness is trivial, and more like evidence that it’s structurally latent. A single steering vector makes the model both choose risky options and say “I am risk-seeking”—despite the self-report behavior never being trained for. That suggests the model’s internal representations of behavior and linguistic self-description are already aligned. It’s probably not introspecting in a deliberate sense, but the geometry makes shallow self-modeling an easy, natural side effect.

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Kvee, Mike Vaiana, Trent Hodgeson

11mo

UPDATE: Recent work with improved AF and compliance gap classifiers disagrees with our results. We recommend using the improved classifiers.

Summary

We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.

Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.
Source: Mistral

Experimental... (read 3766 more words →)

Replying toReducing LLM deception at scale with self-other overlap fine-tuning

Reducing LLM deception at scale with self-other overlap fine-tuning

Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with 'syntactic?' Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.

Replying toReducing LLM deception at scale with self-other overlap fine-tuning

Reducing LLM deception at scale with self-other overlap fine-tuning

The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Kvee, Cameron Berg, Mike Vaiana, Trent Hodgeson

This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support from AE Studio.

Summary

In this post, we summarize the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024.

This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduced the method last July.

Our results show that Self-Other Overlap (SOO) fine-tuning substantially^[1] reduces deceptive responses in LLMs with minimal impact on general performance across the scenarios we evaluated.

LLM Experimental Setup

We adapted a text scenario from Hagendorff designed... (read 1506 more words →)

162

Alignment can be the ‘clean energy’ of AI

Cameron Berg, Kvee, Trent Hodgeson

Not all that long ago, the idea of advanced AI in Washington, DC seemed like a nonstarter. Policymakers treated it as weird sci‐fi-esque overreach/just another Big Tech Thing. Yet, in our experience over the last month, recent high-profile developments—most notably, DeepSeek's release of R1 and the $500B Stargate announcement—have shifted the Overton window significantly.

For the first time, DC policy circles are genuinely grappling with advanced AI as a concrete reality rather than a distant possibility. However, this newfound attention has also brought uncertainty: policymakers are actively searching for politically viable approaches to AI governance, but many are increasingly wary of what they see as excessive focus on safety at the expense of... (read 2279 more words →)

Replying toMaking a conservative case for alignment

We've spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not "median conservatives").

Replying toMaking a conservative case for alignment

Fixed, thanks!

Replying toMaking a conservative case for alignment

Note this is not equivalent to saying 'we're almost certainly going to get AGI during Trump's presidency,' but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).

Replying toMaking a conservative case for alignment

One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like 'alignment' is aligning the AI with the obviously very leftist companies that make it rather than with the user!

Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.

That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)

I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.

Science advances one funeral at a time

Cameron Berg, Kvee, phgubbins, Trent Hodgeson

Trump and the Republican party will wield broad governmental control during what will almost certainly be a critical period for AGI development. In this post, we want to briefly share various frames and ideas we’ve been thinking through and actively pitching to Republican lawmakers over the past months in preparation for the possibility of a Trump win.

Why are we sharing this here? Given that >98% of the EAs and alignment researchers we surveyed earlier this year identified as everything-other-than-conservative, we consider thinking through these questions to be another strategically worthwhile neglected direction.

(Along these lines, we also want to proactively emphasize that politics is the mind-killer, and that, regardless of one’s ideological convictions, those who... (read 1817 more words →)

208

Self-prediction acts as an emergent regularizer

Cameron Berg, Kvee, Diogo de Lucena, Trent Hodgeson

Major scientific institutions talk a big game about innovation, but the reality is that many of the mechanisms designed to ensure quality—peer review, funding decisions, the academic hierarchy—explicitly incentivize incremental rather than revolutionary progress; are far more oriented towards 'day science' than 'night science.'^[1]

Thomas Kuhn's now-famous notion of paradigm shifts was pointing at precisely this phenomenon. When scientists work within what Kuhn called "normal science," they're essentially solving low- to medium-stakes puzzles within their field's accepted framework. While it's fairly easy to evaluate the relative quality of work that occurs within any given paradigm, Kuhn argued it's nearly impossible for scientists to reason about the relative power of different paradigms for a given field—especially when they... (read 447 more words →)

100

•••

The case for a negative alignment tax

Cameron Berg, Kvee, Mike Vaiana, Diogo de Lucena, florin_pop, Trent Hodgeson

TL;DR:

In our recent work with Professor Michael Graziano (arXiv, thread), we show that adding an auxiliary self-modeling objective to supervised learning tasks yields simpler, more regularized, and more parameter-efficient models.

Across three classification tasks and two modalities, self-modeling consistently reduced complexity (lower RLCT, narrower weight distribution). This restructuring effect may help explain the putative benefits of self-models in both ML and biological systems.

Agents who self-model may be reparameterized to better predict themselves, predict others, and be predicted by others. Accordingly, we believe that further exploring the potential effects of self-modeling on cooperation emerges as a promising neglected approach to alignment. This approach may also exhibit a 'negative alignment tax' to the degree that it may... (read 1120 more words →)