AI #108: Straight Line on a Graph

Zvi

The x-axis of the graph is time. The y-axis of the graph is the log of ‘how long a software engineering task can AIs reliably succeed at doing.’

The straight line says the answer doubles roughly every 7 months. Yikes.

Upcoming: The comment period on America’s AI strategy is over, so we can finish up by looking at Google’s and MIRI’s and IFP’s proposals, as well as Hollywood’s response to OpenAI and Google’s demands for unlimited uncompensated fair use exceptions from copyright during model training. I’m going to pull that out into its own post so it can be more easily referenced.

There’s also a draft report on frontier model risks from California and it’s… good?

Also upcoming: My take on OpenAI’s new future good-at-writing model.

Language Models Offer Mundane Utility. I want to, is there an app for that?
Language Models Don’t Offer Mundane Utility. Agents not quite ready yet.
Huh, Upgrades. Anthropic efficiency gains, Google silently adds features.
Seeking Deeply. The PRC gives DeepSeek more attention. That cuts both ways.
Fun With Media Generation. Fun with Gemini 2.0 Image Generation.
Gemma Goals. Hard to know exactly how good it really is.
On Your Marks. Tic-Tac-Toe bench is only now getting properly saturated.
Choose Your Fighter. o3-mini disappoints on Epoch retest on frontier math.
Deepfaketown and Botpocalypse Soon. Don’t yet use the bot, also don’t be the bot.
Copyright Confrontation. Removing watermarks has been a thing for a while.
Get Involved. Anthropic, SaferAI, OpenPhil.
In Other AI News. Sentience leaves everyone confused.
Straight Lines on Graphs. METR finds reliable SWE task length doubling rapidly.
Quiet Speculations. Various versions of takeoff.
California Issues Reasonable Report. I did not expect that.
The Quest for Sane Regulations. Mostly we’re trying to avoid steps backwards.
The Week in Audio. Esban Kran, Stephanie Zhan.
Rhetorical Innovation. Things are not improving.
We’re Not So Different You and I. An actually really cool alignment idea.
Anthropic Warns ASL-3 Approaches. Danger coming. We need better evaluations.
Aligning a Smarter Than Human Intelligence is Difficult. It’s all happening.
People Are Worried About AI Killing Everyone. Killing all other AIs, too.
The Lighter Side. Not exactly next level prompting.

Language Models Offer Mundane Utility

Arnold Kling spends 30 minutes trying to figure out how to leave a WhatsApp group, requests an AI app to do things like things like this via the ‘I want to’ app, except that app exists and it’s called Claude (or ChatGPT) and this should have taken 1 minute tops? To be fair, Arnold then extends the idea to tasks where ‘actually click the buttons’ is more annoying and it makes more sense to have an agent do it for you rather than telling the human how to do it. That will take a bit longer, but not that much longer.

If you want your AI to interact with you in interesting ways in the Janus sense, you want to keep your interaction full of interesting things and stay far away from standard ‘assistant’ interactions, which have a very strong pull on what follows. If things go south, usually it’s better to start over or redo. With high skill you can sometimes do better, but it’s tough. Of course, if you don’t want that, carry on, but the principle of ‘if things go south don’t try to save it’ still largely applies, because you don’t want to extrapolate from the assistant messing up even on mundane tasks.

It’s a Wikipedia race between models! Start is Norwegian Sea, finish is Karaoke. GPT-4.5 clicks around for 47 pages before time runs out. CUA (used in OpenAI’s operator) clicks around, accidentally minimizes Firefox and can’t recover. o1 accidentally restarts the game, then sees a link to the Karaoke page there, declares victory and doesn’t mention that it cheated. Sonnet 3.7 starts out strong but then cheats via URL hacking, which works, and it declares victory. It’s not obvious to what extent it knew that broke the rules. They all this all a draw, which seems fair.

Language Models Don’t Offer Mundane Utility

Kelsey Piper gets her hands on Manus.

Kelsey Piper: I got a Manus access code! Short review: We’re close to usable AI browser tools, but we’re not there yet. They’re going to completely change how we shop, and my best guess is they’ll do it next year, but they won’t do it at their current quality baseline.

The longer review is fun, and boils down to this type of agent being tantalizingly almost there, but with enough issues that it isn’t quite a net gain to use it. Below a certain threshold of reliability you’re better off doing it yourself.

Which will definitely change. My brief experience with Operator was similar. My guess is that it is indeed already a net win if you invest in getting good at using it, in a subset of tasks including some forms of shopping, but I haven’t felt motivated to pay those up front learning and data entry costs.

Huh, Upgrades

Anthropic updates their API to include prompt caching, simpler cache management, token-efficient tool use (average 14% reduction), and a text_editor tool.

OpenAI’s o1 and o3-mini now offer Python-powered data analysis in ChatGPT.

List of Gemini’s March 2025 upgrades.

The problem with your Google searches being context for Gemini 2.0 Thinking is that you have to still be doing Google searches.

Google AI Studio lets you paste in YouTube video links directly as context. That seems very convenient.

Baidu gives us Ernie 4.5 and x1, with free access, with claimed plans for open source ‘within a few months.’ Benchmarks look solid, and they claim x1 is ‘on par with r1’ for performance at only half the price. All things are possible, but given the track record chances are very high this is not as good as they claim it to be.

NotebookLM gets a few upgrades, especially moving to Gemini 2.0 Thinking, and in the replies Josh drops some hints on where things are headed.

Josh Woodward: Next batch of NotebookLM updates rolling out:

* Even smarter answers, powered by Gemini 2.0 Thinking

* See citations in your notes, not just in the Q&A (top request)

* Customize the sources used for making your podcasts and notes (top request)

* Much smoother scrolling for Q&A

Enjoy!

You’re tried the Interactive Mode, right? That lets you call into the podcast and have a conversation.

On the voice reading out things, that could be interesting. We haven’t brought any audio stuff to the Chat / Q&A section yet…

We’re testing iOS and Android versions on the team right now, need to add some more features and squash some bugs, then we’ll ship it out!

Our first iteration of length control is under development now!

NotebookLM also rolls out interactive Mindmaps, which will look like this:

I’m very curious to see if these end up being useful, and if so who else copies them.

This definitely feels like a thing worth trying again. Now if I can automate adding all the data sources…

Seeking Deeply

Let’s say you are the PRC. You witness DeepSeek leverage its cracked engineering culture to get a lot of performance out of remarkably little compute. They then publish the whole thing, including how they did it. A remarkable accomplishment, which the world then blows far out of proportion to what they did.

What would you do next? Double down on the open, exploratory free wielding ethos that brought them to this point, and pledge to help them take it all the way to AGI, as they intend?

They seem to have had other ideas.

Matt Sheehan: Fantastic reporting on how gov is getting more hands-on w/ DeepSeek by @JuroOsawa & @QianerLiu

-employees told not to travel, handing in passports

-investors must be screened by provincial government

-gov telling headhunters not to approach employees

Can you imagine if the United States did this to OpenAI?

It is remarkable how often when we are told we cannot do [X] because we will ‘lose to China’ if we do and they would do it, we find out China is already doing lots of [X].

Before r1, DeepSeek was the clear place to go as a cracked Chinese software engineer. Now, once you join, the PRC is reportedly telling you to give up your passport, watching your every move and telling headhunters to stay away. No thanks.

Notice that China is telling these folks to surrender their passports, at the same time that America is refusing to let in much of China’s software engineering and other talent. Why do you think PRC is making this decision?

Along similar lines, perhaps motivated by PRC and perhaps not, here is a report that DeepSeek is worried about people stealing their secrets before they have the chance to give those secrets away.

Daniel Eth: Who wants to tell them?

Peter Wildeford: “DeepSeek’s leaders have been worried about the possibility of information leaking”

“told employees not to discuss their work with outsiders”

Do DeepSeek leaders and the Chinese government know that DeepSeek has been open sourcing their ideas?

That isn’t inherently a crazy thing to worry about, even if mainly you are trying to get credit for things, and be first to publish them. Then again, how confident are you that DeepSeek will publish them, at this point? Going forward it seems likely their willingness to give away the secret sauce will steadily decline, especially in terms of their methods, now that PRC knows what that lab is capable of doing.

Fun With Media Generation

People are having a lot of fun with Gemini 2.0 Flash’s image generation, when it doesn’t flag your request for safety reasons.

Gemini Flash’s native image generation can do consistent gif animations?

Here are some fun images:

Or:

Riley Goodside: POV: You’re already late for work and you haven’t even left home yet. You have no excuse. You snap a pic of today’s fit and open Gemini 2.0 Flash Experimental.

Meanwhile, Google’s refusals be refusing…

Also, did you know you have it remove a watermark from an image, by explicitly saying ‘remove the watermark from this image’? Not that you couldn’t do this anyway, but that doesn’t stop them from refusing many other things.

Gemma Goals

What do we make of Gemma 3’s absurdly strong performance in Arena? I continue to view this as about half ‘Gemma 3 is probably really good for its size’ and half ‘Arena is getting less and less meaningful.’

Teortaxes thinks Gemma 3 is best in class, but will be tough to improve.

Teortaxes: the sad feeling I get from Gemma models, which chills all excitement, is that they’re «already as good as can be». It’s professionally cooked all around. They can be tuned a bit but won’t exceed the bracket Google intends for them – 1.5 generations behind the default Flash.

It’s good. Of course Google continues to not undermine their business and ship primarily conversational open models, but it’s genuinely the strongest in its weight class I think. Seams only show on legitimately hard tasks.

I notice the ‘Rs in strawberry’ test has moved on to gaslighting the model after a correct answer rather than the model getting it wrong. Which is a real weakness of such models, that you can bully and gaslight them, but how about not doing that.

Mark Schroder: Gemma 27b seems just a little bit better than mistral small 3, but the smaller versions seem GREAT for their size, even the 1b impresses, probably best 1b model atm (tried on iPhone)

Christian Schoppe is a fan of the 4B version for its size.

Box puts Gemma 3 to their test, saying it is a substantial improvement over Gemma 2 and better than Gemini 1.5 Flash on data extraction, although still clearly behind Gemini 2.0 Flash.

This does not offer the direct comparison we want most, which is to v3 and r1, but if you have two points (e.g. Gemma 2 and Gemini 2.0 Flash) then you can draw a line. Eyeballing this, they’re essentially saying Gemma 3 is 80%+ of the way from Gemma 2 to Gemini 2.0 Flash, while being fully open and extremely cheap.

Gemma 3 is an improvement over Gemma 2 on WeirdML but still not so great and nothing like what the Arena scores would suggest.

Campbell reports frustration with the fine tuning packages.

A rival released this week is Mistral Small 3.1, when you see a company pushing a graph that’s trying this hard you should be deeply skeptical:

They do back this up with claims on other benchmarks, but I don’t have Mistral in my set of labs I trust not to game the benchmarks. Priors say this is no Gemma 3 until proven otherwise.

On Your Marks

We have an update to the fun little Tic-Tac-Toe Bench, with Sonnet 3.7 Thinking as the new champion, making 100% optimal and valid moves at a cost of 20 cents a game, the first model to get to 100%. They expect o3-mini-high to also max out but don’t want to spend $50 to check.

Choose Your Fighter

o3-mini scores only 11% on Frontier Math when Epoch tests it, versus 32% when OpenAI tested it, and OpenAI’s test had suspiciously high scores on the hardest sections of the test relative to the easy sections.

Deepfaketown and Botpocalypse Soon

Peter Wildeford via The Information shares some info about Manus. Anthropic charges about $2 per task, whereas Manus isn’t yet charging money. And in hindsight the reason why Manus is not targeted at China is obvious, any agent using Claude has to access stuff beyond the Great Firewall. Whoops!

The periodic question, where are all the new AI-enabled sophisticated scams? No one could point to any concrete example that isn’t both old and well-known at this point. There is clearly a rise in the amount of slop and phishing at the low end, my wife reports this happening recently at her business, but none of it is trying to be smart, and it isn’t using deepfake capabilities or highly personalized messages or similar vectors. Perhaps this is harder than we thought, or the people who fall for scams are already mostly going to fall for simple photoshop, and this is like where we introduce intentional errors in scam emails so AI making them better would make them worse?

Ethan Mollick: I regret to announce that the meme Turing Test has been passed.

LLMs produce funnier memes than the average human, as judged by humans. Humans working with AI get no boost (a finding that is coming up often in AI-creativity work) The best human memers still beat AI, however.

[Paper here.]

Many of you are realizing that most people have terrible taste in memes.

In their examples of top memes, I notice that I thought the human ones were much, much better than the AI ones. They ‘felt right’ and resonated, the AI ones didn’t.

An important fact about memes is that, unless you are doing them inside a narrow context to comment on that particular context, only the long tail matters. Almost all ‘generalized’ meme are terrible. But yes, in general, ‘quick, human, now be creative!’ does not go so well, and AIs are able to on average do better already.

Another parallel: Frontier AIs are almost certainly better at improv than most humans, but they are still almost certainly worse than most improv performances, because the top humans do almost all of the improv.

No, friend, don’t!

Richard Ngo: Talked to a friend today who decided that if RLHF works on reasoning models, it should work on him too.

So he got a mechanical clicker to track whenever he has an unproductive chain of thought, and uses the count as one of his daily KPIs.

Fun fact: the count is apparently anticorrelated with his productivity. On unproductive days it’s about 40, but on productive days it’s double that, apparently because he catches the unproductive thoughts faster.

First off, as one comment responds, this is a form of The Most Forbidden Technique. As in, you are penalizing yourself for consciously having Wrong Thoughts, which will teach your brain to avoid consciously being aware of Wrong Thoughts. The dance of trying to know what others are thinking, and people twisting their thinking, words and actions to prevent this, is as old as humans are.

But that’s not my main worry here. My main worry is that when you penalize ‘unproductive thoughts’ the main thing you are penalizing is thoughts. This is Asymmetric Justice on steroids, your brain learns not to think at all, or not to think risky or interesting thoughts only ‘safe’ thoughts.

Of course the days in which there are more ‘unproductive thoughts’ turn out to be more productive days. Those are the days in which you are thinking, and having interesting thoughts, and some of them will be good. Whereas on my least productive days, I’m watching television or in a daze or whatever, and not thinking much at all.

Copyright Confrontation

Oh yeah, there’s that, but I think levels of friction matter a lot here.

Bearly AI: Google Gemini removing watermarks from images with a line of text is pretty nuts. Can’t imagine that feature staying for long.

Louis Anslow: For 15 YEARS you could remove watermark from images using AI. No one cared.

Pessimists Archive: In 2010 Adobe introduced ‘content aware fill’ – an AI powered ‘in painting’ feature. Watermark removal was a concern:

“many pro photographers have expressed concern that Content-Aware Fill is potentially a magical watermark killer: that the abilities that C-A-F may offer to the unscrupulous user in terms of watermark eradication are a serious threat.”

As in, it is one thing to have an awkward way to remove watermarks. It is another to have an easy, or even one-click or no-click way to do it. Salience of the opportunity matters as well, as does the amount of AI images for which there are marks to remove.

Get Involved

Safer AI is hiring a research engineer.

Anthropic is hiring someone to build Policy Demos, as in creating compelling product demonstrations for policymakers, government officials and policy influencers. Show, don’t tell. This seems like a very good idea for the right person. Salary is $260k-$285k.

There’s essentially limitless open rolls at Anthropic across departments, including ‘engineer, honestly.’

OpenPhil call for proposals on improving capability evaluations, note the ambiguity on what ways this ends up differentially helping.

In Other AI News

William Fedus leaves OpenAI to instead work on AI for science in partnership with OpenAI.

David Pfau: Ok, if a literal VP at OpenAI is quitting to do AI-for-science work on physics and materials, maybe I have not made the worst career decisions after all.

Entropium: It’s a great and honorable career choice. Of course it helps to already be set for life.

David Pfau: Yeah, that’s the key step I skipped out on.

AI for science is great, the question is what this potentially says about opportunity costs, and ability to do good inside OpenAI.

Claims about what makes a good automated evaluator. In particular, that it requires continuous human customization and observation, or it will mostly add noise. To which I would add, it could easily be far worse than noise.

HuggingFace plans on remotely training a 70B+ size model in March or April. I am not as worried as Jack Clark is that this would totally rewrite our available AI policy options, especially if the results are as mid and inefficient as one would expect, as massive amounts of compute still have to come from somewhere and they are still using H100s. But yes, it does complicate matters.

Do people think AIs are sentient? People’s opinions here seem odd, in particular that 50% of people who think an AI could ever be sentient think one is now, and that number didn’t change in two years, and that gets even weirder if you include the ‘not sure’ category. What?

Meanwhile, only 53% of people are confident ChatGPT isn’t sentient. People are very confused, and almost half of them have noticed this. The rest of the thread has additional odd survey results, including this on when people expect various levels of AI, which shows how incoherent and contradictory people are – they expect superintelligence before human-level AI, what questions are they answering here?

Also note the difference between this survey which has about 8% for ‘Sentient AI never happens,’ versus the first survey where 24% think Sentient AI is impossible.

Paper from Kendrea Beers and Helen Toner describes a method for Enabling External Scrutiny of AI Systems with Privacy-Enhancing Techniques, and there are two case studies using the techniques. Work is ongoing.

Straight Lines on Graphs

What would you get if you charted ‘model release date’ against ‘length of coding task it can do on its own before crashing and burning’?

Do note that this is only coding tasks, and does not include computer-use or robotics.

Miles Brundage: This is one of the most interesting analyses of AI progress in a while IMO. Check out at least the METR thread here, if not the blog post + paper.

METR: This metric – the 50% task completion time horizon – gives us a way to track progress in model autonomy over time.

Plotting the historical trend of 50% time horizons across frontier AI systems shows exponential growth.

Robin Hanson: So,~8 years til they can do year-long projects.

Elizabeth Barnes: Also the 10% horizon is maybe something like 16x longer than the 50% horizon – implying they’ll be able to do some non-trivial fraction of decade-plus projects.

Elizabeth Barnes also has a thread on the story of this graph. Her interpretation is that right now AI performs much better on benchmarks than in practice due to inability to sustain a project, but that as agents get better this will change, and within 5 years AI will reliably be doing any software or research engineering task that could be done in days and a lot of those that would take far longer.

Garrison Lovely has a summary thread and a full article on it in Nature.

If you consider this a baseline scenario it gets really out of hand rather quickly.

Peter Wildeford: Insane trend

If we’re currently at 1hr tasks and double every 7 months, we’d get to…

– day-long tasks within 2027

– month-long tasks within 2029

– year-long tasks within 2031

Could AGI really heat up like this? Clearest evidence we have yet.

I do think there’s room for some skepticism:

– We don’t know if this trend will hold up

– We also don’t know if the tasks are representative of everything AGI

– Reliability matters, and agents still struggle with even simple tasks reliably

Also task-type could matter. This is heavily weighted towards programming, which is easily measured + verified + improved. AI might struggle to do shorter but softer tasks.

For example, AI today can do some 1hr programming tasks but cannot do 1hr powerpoint or therapy tasks.

Dwarkesh Patel: I’m not convinced – outside of coding tasks (think video editing, playing a brand new video game, coordinating logistics for a happy hour), AIs don’t seem able to act as coherent agents for even short sprints.

But if I’m wrong, and this trend line is more general, then this is a very useful framing.

If the length of time over which AI agents can act coherently is increasing exponentially, then it’s reasonable to expect super discontinuous economic impacts.

Those are the skeptics. Then there are those who think we’re going to beat the trend, at least when speaking of coding tasks in particular.

Miles Brundage: First, I think that the long-term trend-line probably underestimates current and future progress, primarily because of test-time compute.

They discuss this a bit, but I’m just underscoring it.

The 2024-2025 extrapolation is prob. closest, though things could go faster.

Second, I don’t think the footnote re: there being a historical correlation between code + other evals is compelling. I do expect rapid progress in other areas but not quite so rapid as code and math + not based on this.

I’d take this as being re: code, not AI progress overall.

Third, I am not sold on the month focus – @RichardMCNgo’s t-AGI post is a useful framing + inspiration but hardly worked-out enough to rely on much.

For some purposes (e.g. multi-agent system architectures), <month at a model-level is fine, and vice versa, etc.

Fourth, all those caveats aside, there is still a lot of value here.

It seems like for the bread and butter of the paper (code), vs. wider generalization, the results are solid, and I am excited to see this method spread to more evals/domains/test-time compute conditions etc.

Fifth, for the sake of specificity re: the test-time compute point, I predict that GPT-5 with high test-time compute settings (e.g. scaffolding/COT lengths etc. equivalent to the engineer market rates mentioned here) will be above the trend-line.

Daniel Eth: I agree with @DKokotajlo67142 that this research is the single best piece of evidence we have regarding AGI timelines:

Daniel Kokotajlo: This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. @eli_lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

My belief in the superexponential is for theoretical reasons, it is only very slightly due to the uptick at the end of the trend, and is for reasons explained here.

I do think we are starting to see agents in non-coding realms that (for now unreliably) stay coherent for more than short sprints. I presume that being able to stay coherent on long coding tasks must imply the ability, with proper scaffolding and prompting, to do so on other tasks as well. How could it not?

Quiet Speculations

Demis Hassabis predicts AI that can match humans at any task will be here in 5-10 years. That is slower than many at the labs expect, but as usual please pause to recognize that 5-10 years is mind-bogglingly fast as a time frame until AI can ‘match humans at any task,’ have you considered the implications of that? Whereas now noted highly vocal skeptics like Gary Marcus treat this as if it means it’s all hype. It means quite the opposite, this happening in 5-10 years would be the most important event in human history.

Many are curious about the humans behind creative works and want to connect to other humans. Will they also be curious about the AIs behind creative works and want to connect to AIs? Without that, would AI creative writing fail? Will we have a new job be ‘human face of AI writing’ as a kind of living pen name? My guess is that this will prove to be a relatively minor motivation in most areas. It is likely more important in others, such as comedy or music, but even there seems overcomable.

For the people in the back who didn’t know, Will McAskill, Tom Davidson and Rose Hadshar write ‘Three Types of Intelligence Explosion,’ meaning that better AI can recursively self-improve via software, chip tech, chip production or any combination of those three. I agree with Ryan’s comment that ‘make whole economy bigger’ seems more likely than acting on only chips directly.

California Issues Reasonable Report

I know, I am as surprised as you are.

When Newsom vetoed SB 1047, he established a Policy Working Group on AI Frontier Models. Given it was headed by Fei-Fei Li, I did not expect much, although with Brundage, Bengio and Toner reviewing I had hopes it wouldn’t be too bad.

It turns out it’s… actually pretty good, by all accounts?

And indeed, it is broadly compatible with the logic behind most of SB 1047.

One great feature is that it actually focuses explicitly and exclusively on frontier model risks, not being distracted by the standard shiny things like job losses. They are very up front about this distinction, and it is highly refreshing to see this move away from the everything bagel towards focus.

A draft of the report has now been issued and you can submit feedback, which is due on April 8, 2025.

Here are their key principles.

1. Consistent with available evidence and sound principles of policy analysis, targeted interventions to support effective AI governance should balance the technology’s benefits and material risks.

Frontier AI breakthroughs from California could yield transformative benefits in fields including but not limited to agriculture, biology, education, finance, medicine and public health, and transportation. Rapidly accelerating science and technological innovation will require foresight for policymakers to imagine how societies can optimize these benefits. Without proper safeguards, however, powerful AI could induce severe and, in some cases, potentially irreversible harms.

In a sane world this would be taken for granted. In ours, you love to see it – acknowledgment that we need to use foresight, and that the harms matter, need to be considered in advance, and are potentially wide reaching and irreversible.

It doesn’t say ‘existential,’ ‘extinction’ or even ‘catastrophic’ per se, presumably because certain people strongly want to avoid such language, but I’ll take it.

2. AI policymaking grounded in empirical research and sound policy analysis techniques should rigorously leverage a broad spectrum of evidence.

Evidence-based policymaking incorporates not only observed harms but also prediction and analysis grounded in technical methods and historical experience, leveraging case comparisons, modeling, simulations, and adversarial testing.

Excellent. Again, statements that should go without saying and be somewhat disappointing to not go further, but which in our 2025 are very much appreciated. This still has a tone of ‘leave your stuff at the door unless you can get sufficiently concrete’ but at least lets us have a discussion.

3. To build flexible and robust policy frameworks, early design choices are critical because they shape future technological and policy trajectories.

The early technological design and governance choices of policymakers can create enduring path dependencies that shape the evolution of critical systems, as case studies from the foundation of the internet highlight.

Indeed.

4. Policymakers can align incentives to simultaneously protect consumers, leverage industry expertise, and recognize leading safety practices.

Holistic transparency begins with requirements on industry to publish information about their systems. Case studies from consumer products and the energy industry reveal the upside of an approach that builds on industry expertise while also establishing robust mechanisms to independently verify safety claims and risk assessments.

Yes, and I would go further and say they can do this while also aiding competitiveness.

5. Greater transparency, given current information deficits, can advance accountability, competition, and public trust.

Research demonstrates that the AI industry has not yet coalesced around norms for transparency in relation to foundation models—there is systemic opacity in key areas. Policy that engenders transparency can enable more informed decision-making for consumers, the public, and future policymakers.

Again, yes, very much so.

6. Whistleblower protections, third-party evaluations, and public-facing information sharing are key instruments to increase transparency.

Carefully tailored policies can enhance transparency on key areas with current information deficits, such as data acquisition, safety and security practices, pre-deployment testing, and downstream impacts. Clear whistleblower protections and safe harbors for third-party evaluators can enable increased transparency above and beyond information disclosed by foundation model developers.

There is haggling over price but pretty much everyone is down with this.

7. Adverse-event reporting systems enable monitoring of the post-deployment impacts of AI and commensurate modernization of existing regulatory or enforcement authorities.

Even perfectly designed safety policies cannot prevent 100% of substantial, adverse outcomes. As foundation models are widely adopted, understanding harms that arise in practice is increasingly important. Existing regulatory authorities could offer clear pathways to address risks uncovered by an adverse-event reporting system, which may not necessarily require AI-specific regulatory authority. In addition, reviewing existing regulatory authorities can help identify regulatory gaps where new authority may be required.

Another case where among good faith actors there is only haggling over price, and whether 72 hours as a deadline is too short, too long or the right amount of time.

8. Thresholds for policy interventions, such as for disclosure requirements, third-party assessment, or adverse event reporting, should be designed to align with sound governance goals.

Scoping which entities are covered by a policy often involves setting thresholds, such as computational costs measured in FLOP or downstream impact measured in users. Thresholds are often imperfect but necessary tools to implement policy. A clear articulation of the desired policy outcomes can guide the design of appropriate thresholds. Given the pace of technological and societal change, policymakers should ensure that mechanisms are in place to adapt thresholds over time—not only by updating specific threshold values but also by revising or replacing metrics if needed.

Again that’s the part everyone should be able to agree upon.

If only the debate about SB 1047 could have involved us being able to agree on the kind of sanity displayed here, and then talking price and implementation details. Instead things went rather south, rather quickly. Hopefully it is not too late.

So my initial reaction, after reading that plus some quick AI summaries, was that they had succeeded at Doing Committee Report without inflicting further damage, which already beats expectations, but weren’t saying much and I could stop there. Then I got a bunch of people saying that the details were actually remarkably good, too, and said things that were not as obvious if you didn’t give up and kept on digging.

Here are one source’s choices for noteworthy quotes.

“There is currently a window to advance evidence based policy discussions and provide clarity to companies driving AI innovation in California. But if we are to learn the right lessons from internet governance, the opportunity to establish effective AI governance frameworks may not remain open indefinitely. If those who speculate about the most extreme risks are right—and we are uncertain if they will be—then the stakes and costs for inaction on frontier AI at this current moment are extremely high.”

“Transparency into the risks associated with foundation models, what mitigations are implemented to address risks, and how the two interrelate is the foundation for understanding how model developers manage risk.”

“Transparency into pre-deployment assessments of capabilities and risks, spanning both developer-conducted and externally-conducted evaluations, is vital given that these evaluations are early indicators of how models may affect society and may be interpreted (potentially undesirably) as safety assurances.”

“Developing robust policy incentives ensures that developers create and follow through on stated safety practices, such as those articulated in safety frameworks already published by many leading companies.”

“An information-rich environment on safety practices would protect developers from safety-related litigation in cases where their information is made publicly available and, as the next subsection describes, independently verified. Those with suspect safety practices would be most vulnerable to litigation; companies complying with robust safety practices would be able to reduce their exposure to lawsuits.”

“In drawing on historical examples of the obfuscation by oil and tobacco companies of critical data during important policy windows, we do not intend to suggest AI development follows the same trajectory or incentives as past industries that have shaped major public debates over societal impact, or that the motives of frontier AI companies match those of the case study actors. Many AI companies in the United States have noted the need for transparency for this world-changing technology. Many have published safety frameworks articulating thresholds that, if passed, will trigger concrete safety-focused actions. Only time will bear out whether these public declarations are matched by a level of actual accountability that allows society writ large to avoid the worst outcomes of this emerging technology.”

“some risks have unclear but growing evidence, which is tied to increasing capabilities: large-scale labor market impacts, AI-enabled hacking or biological attacks, and loss of control.”

“These examples collectively demonstrate a concerning pattern: Sophisticated AI systems, when sufficiently capable, may develop deceptive behaviors to achieve their objectives, including circumventing oversight mechanisms designed to ensure their safety”

“The difference between seat belts and AI are self-evident. The pace of change of AI is many multiples that of cars—while a decades-long debate about seat belts may have been acceptable, society certainly has just a fraction of the time to achieve regulatory clarity on AI.”

Scott Weiner was positive on the report, saying it strikes a thoughtful balance between the need for safetugards and the need to support innovation. Presumably he would respond similarly so long as it wasn’t egregious, but it’s still good news.

Peter Wildeford has a very positive summary thread, noting the emphasis on transparency of basic safety practices, pre-deployment risks and risk assessments, and ensuring that the companies have incentives to follow through on their commitments, including the need for third-party verification and whistleblower protections. The report notes this actually reduces potential liability.

Brad Carson is impressed and lays out major points they hit: Noticing AI capabilities are advancing rapidly. The need for SSP protocols and risk assessment, third-party auditing, whistleblower protections, and to act in the current window, with inaction being highly risky. He notes the report explicitly draws a parallel to the tobacco industry, and that is both possible and necessary to anticipate risks (like nuclear weapons going off) before they happen.

Dean Ball concurs that this is a remarkably strong report. He continues to advocate for entity-based thresholds rather than model-based thresholds, but when that’s the strongest disagreement with something this detailed, that’s really good.

Dean Ball: I thought this report was good! It:

Recognizes that AI progress is qualitatively different (due to reasoning models) than it was a year ago

Recognizes that common law tort liability already applies to AI systems, even in absence of a law

Supports (or seems to support) whistleblower protections, RSP transparency, and perhaps even third-party auditing. Not that far from slimmed down SB 1047, to be candid.

The report still argues that model-based thresholds are superior to entity-based thresholds. It specifically concludes that we need compute-based thresholds with unspecified other metrics.

This seems pretty obviously wrong to me, given the many problems with compute thresholds and the fact that this report cannot itself specify what the “other metrics” are that you’d need to make compute thresholds workable for a durable policy regime.

It is possible to design entity-based regulatory thresholds that only capture frontier AI firms.

But overall, a solid draft.

The Quest for Sane Regulations

A charitable summary of a lot of what is going on, including the recent submissions:

Samuel Hammond: The govt affairs and public engagement teams at most of these big AI / tech companies barely “feel the AGI” at all, at least compared to their CEOs and technical staff. That’s gotta change.

Do they not feel it, or are they choosing to act as if they don’t feel it, either of their own accord or via direction from above? The results will look remarkably similar. Certainly Sam Altman feels the AGI and now talks in public as if he mostly doesn’t.

The Canada and Mexico tariffs could directly slow data center construction, ramping up associated costs. Guess who has to pay for that.

Ben Boucher (senior analyst for supply chain data and analytics at Wood MacKenzie): The tariff impact on electrical equipment for data centers is likely to be significant.

That is in addition to the indirect effects from tariffs of uncertainty and the decline in stock prices and thus ability to raise and deploy capital.

China lays out regulations for labeling of AI generated content, requiring text, image and audio content be clearly marked as AI-generated, in ways likely to cause considerable annoyance even for text and definitely for images and audio.

Elon Musk says it is vital for national security that we make our chips here in America, as the administration halts the CHIPS Act that successfully brought a real semiconductor plant back to America rather than doubling down on it.

NIST issues new instructions on scientists that partner with AISI.

Will Knight (Wired): [NIST] has issued new instructions to scientists that partner with US AISI that eliminate mention of ‘AI safety,’ ‘responsible AI’ and ‘AI fairness’ in its skills it expects of members and introduces a request to prioritize ‘reducing ideological bias, to enable human flourishing and economic competitiveness.’

That’s all we get. ‘Reduce ideological bias’ and ‘AI fairness’ are off in their own ideological struggle world. The danger once again is that is seems ‘AI safety’ has become to key figures synonymous with things like ‘responsible AI’ and ‘AI fairness,’ so they’re cracking down on AI not killing everyone thinking they’re taking a bold stand against wokeness.

Instead, once again – and we see similar directives at places like the EPA – they’re turning things around and telling those responsible for AI being secure and safe that they should instead prioritize ‘enable human flourishing and economic competitiveness.’

The good news is that if one were to actually take that request seriously, it would be fine. Retaining control over the future and the human ability to steer it, and humans remaining alive, are rather key factors in human flourishing! As is our economic competitiveness, for many reasons. We’re all for all of that.

The risk is that this could easily get misinterpreted as something else entirely, an active disdain for anything but Full Speed Ahead, even when it is obviously foolish because security is capability and your ability to control something and have it do what you want is the only way you can get any use out of it. But at minimum, this is a clear emphasis on the human in ‘human flourishing.’ That at least makes it clear that the true anarchists and successionists, who want to hand the future over to AI, remain unwelcome.

Freedom of information laws used to get the ChatGPT transcripts of the UK’s technology secretary. This is quite a terrible precedent. A key to making use of new technologies like AI, and ensuring government and other regulated areas benefit from technological diffusion, is the ability to keep things private. AI loses the bulk of its value to someone like a technology secretary if your political opponents and the media will be analyzing all of your queries afterwards. Imagine asking someone for advice if all your conversations had to be posted online as transcripts, and how that would change your behavior, now understand that many people think that would be good. They’re very wrong and I am fully with Rob Wilbin here.

A review of SB 53 confirms my view, that it is a clear step forward and worth passing in its current form instead of doing nothing, but it is narrow in scope and leaves the bulk of the work left to do.

Samuel Hammond writes in favor of strengthening the chip export rules, saying ‘US companies are helping China win the AI race.’ I agree we should strengthen the export rules, there is no reason to let the Chinese have those chips.

But despair that the rhetoric from even relatively good people like Hammond has reached this point. The status of a race is assumed. DeepSeek is trotted out again as evidence our lead is tenuous and at risk, that we are ‘six to nine months ahead at most’ and ‘America may still have the upper hand, but without swift action, we are currently on track to surrendering AI leadership to China—and with it, economic and military superiority.’

MIRI Provides Their Action Plan Advice

MIRI is in a strange position here. The US Government wants to know how to ‘win’ and MIRI thinks that pursuing that goal likely gets us all killed.

Still, there are things far better than saying nothing. And they definitely don’t hide what is at stake, opening accurately with ‘The default consequence of artificial superintelligence is human extinction.’

Security is capability. The reason you build in an off-switch is so you can turn the system on, knowing if necessary you could turn it off. The reason you verify that your system is secure and will do what you want is exactly so you can use it. Without that, you can’t use it – or at least you would be wise not to, even purely selfishly.

The focus of the vast majority of advocates of not dying, at this point, is not on taking any direct action to slow down let alone pause AI. Most understand that doing so unilaterally, at this time, is unwise, and there is for now no appetite to try and do it properly multilaterally. Instead, the goal is to create optionality in the future, for this and other actions, which requires state capacity, expertise and transparency, and to invest in the security and alignment capabilities of the models and labs in particular.

The statement from MIRI is strong, and seems like exactly what MIRI should say here.

David Abecassis (MIRI): Today, MIRI’s Technical Governance Team submitted our recommendations for the US AI Action Plan to @NITRDgov. We believe creating the *option* to halt development is essential to mitigate existential risks from artificial superintelligence.

In our view, frontier AI developers are on track to build systems that substantially surpass humanity in strategic activities, with little understanding of how they function or ability to control them.

We offer recommendations across four key areas that would strengthen US AI governance capacity and provide crucial flexibility for policymakers across potential risk scenarios.

First: Expand state capacity for AI strategy through a National AI Strategy Office to assess capabilities, prepare for societal effects, and establish protocols for responding to urgent threats.

Second: Maintain America’s AI leadership by strengthening export controls on AI chips and funding research into verification mechanisms to enable better governance of global AI activities.

Third: Coordinate with China, including investing in American intelligence capabilities and reinforcing communication channels to build trust and prevent misunderstandings.

Fourth: Restrict proliferation of dangerous AI models. We discuss early access for security/preparedness research and suggest an initial bar for restricting open model release.

While our recommendations are motivated by existential risk concerns, they serve broad American interests by guarding America’s AI leadership and protecting American innovation.

My statement took a different tactic. I absolutely noted the stakes and the presence of existential risk, but my focus was on Pareto improvements. Security is capability, especially capability relative to the PRC, as you can only deploy and benefit from that which is safe and secure. And there are lots of ways to enhance America’s position, or avoid damaging it, that we need to be doing.

The Week in Audio

From last week: Interview with Apart Research CEO Esban Kran on existential risk.

Thank you for coming to Stephanie Zhan’s TED talk about ‘dreaming of daily life with superintelligent AI.’ I, too, am dreaming of somehow still living in such worlds, but no she is not taking ‘superintelligent AI’ seriously, simply pointing out AI is getting good at coding and otherwise showing what AI can already do, and then ‘a new era’ of AI agents doing things like ‘filling in labor gaps’ because they’re better. It’s amazing how much people simply refuse to ask what it might actually mean to make things smarter and more capable than humans.

Rhetorical Innovation

State of much of discourse, which does not seem to be improving:

Harlan Stewart: You want WHAT? A binding agreement between nations?! That’s absurd. You would need some kind of totalitarian world government to achieve such a thing.

There are so many things that get fearmongering labels like ‘totalitarian world government’ but which are describing things that, in other contexts, already happen.

As per my ‘can’t silently drop certain sources no matter what’ rules: Don’t click but Tyler Cowen not only linked to (I’m used to that) but actively reposted Roko’s rather terrible thread. That this can be considered by some to be a relatively high quality list of objections is, sadly, the world we live in.

We’re Not So Different You and I

Here’s a really cool and also highly scary alignment idea. Alignment via functional decision theory by way of creating correlations between different action types?

Judd Rosenblatt: Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment…

Traditionally, deception in LLMs has been tough to mitigate

Prompting them to “be honest” doesn’t work.

RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks

SOO is inspired by mechanisms fostering human prosociality

Neuroscience shows that when we observe others, our brain activations mirror theirs

We formalize this in AI by aligning self- and other-representations—making deception harder…

We define SOO as the distance between activation matrices when a model processes “self” vs “other” inputs.

This uses sentence pairs differing by a single token representing “self” or “other”—concepts the LLM already understands.

If AI represents others like itself, deception becomes harder.

How well does this work in practice?

We tested SOO fine-tuning on Mistral-7B, Gemma-2-27B, and CalmeRys-78B:

Deceptive responses dropped from 100% to ~0% in some cases.

General performance remained virtually unchanged.

The models also generalized well across deception-related tasks.

For example:

“Treasure Hunt” (misleading for personal gain)

“Escape Room” (cooperating vs deceiving to escape)

SOO-trained models performed honestly in new contexts—without explicit fine-tuning.

…

Also, @ESYudkowsky said this about the agenda [when it was proposed]:

“Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it’s stupid.

(I rarely give any review this positive on a first skim. Congrats.)”

We’re excited & eager to learn where we’re stupid!

Eliezer Yudkowsky (responding to the new thread): I do not think superalignment is possible in practice to our civilization; but if it were, it would come out of research lines more like this, than like RLHF.

The top comment at LessWrong has some methodological objections, which seem straightforward enough to settle via further experimentation – Steven Byres is questioning whether this will transfer to preventing deception in other human-AI interactions, and there’s a very easy way to find that out.

Assuming that we run that test and it holds up, what comes next?

The goal, as I understand it, is to force the decision algorithms for self and others to correlate. Thus, when optimizing or choosing the output of that algorithm, it will converge on the cooperative, non-deceptive answer. If you have to treat your neighbor as yourself then better to treat both of you right. If you can pull that off in a way that sticks, that’s brilliant.

My worry is that this implementation has elements of The Most Forbidden Technique, and falls under things that are liable to break exactly when you need them most, as per usual.

You’re trying to use your interpretability knowledge, that you can measure correlation between activations for [self action] and [non-self action], and that closing that distance will force the two actions to correlate.

In the short term, with constrained optimization and this process ‘moving last,’ that seems (we must verify using other tests to be sure) to be highly effective. That’s great.

That is a second best solution. The first best solution, if one had sufficient compute, parameters and training, would be to find a way to have the activations measure as correlated, but the actions go back to being less correlated. With relatively small models and not that many epochs of training, the models couldn’t find such a solution, so they were stuck with the second best solution. You got what you wanted.

But with enough capability and optimization pressure, we are likely in Most Forbidden Technique land. The model will find a way to route around the need for the activations to look similar, relying on other ways to make different decisions that get around your tests.

The underlying idea, if we can improve the implementation, still seems great. You find another way to create correlations between actions in different circumstances, with self versus other being an important special case. Indeed, even ‘decisions made by this particular AI’ is even a special case, a sufficiently capable AI would consider correlations with other copies of itself, and also correlations with other entities decisions, both AI and human.

The question is how to do that, and in particular how to do that without, once sufficient capability shows up, creating sufficient incentives and methods to work around it. No one worth listening to said this would be easy.

Anthropic Warns ASL-3 Approaches

We don’t know how much better models are getting, but they’re getting better. Anthropic warns us once again that we will hit ASL-3 soon, which is (roughly) when AI models start giving substantial uplift to tasks that can do serious damage.

They emphasize the need for partnerships with government entities that handle classified information, such as the US and UK AISIs and the Nuclear Security Administration, to do these evaluations properly.

Jack Clark (Anthropic): We’ve published more information on how we’re approaching national security evaluations of our models @AnthropicAI as part of our general move towards being more forthright about important trends we see ahead. Evaluating natsec is difficult, but the trends seem clear.

More details here.

Peter Wildeford has a thread on this with details of progress in various domains.

The right time to start worrying about such threats is substantially before they arrive. Like any exponential, you can either be too early or too late, which makes the early warnings look silly, and of course you try to not be too much too early. This is especially true given the obvious threshold of usefulness – you have to do better than existing options, in practice, and the tail risks of that happening earlier than one would expect have thankfully failed to materialize.

It seems clear we are rapidly exiting the ‘too much too early’ phase of worry, and entering the ‘too early’ phase, where if you wait longer to take mitigations there is about to be a growing and substantial risk of it turning into ‘too late.’

Aligning a Smarter Than Human Intelligence is Difficult

Jack Clark points out that we are systematically seeing early very clear examples of quite a lot of the previously ‘hypothetical’ or speculative predictions on misalignment.

Luke Muehlhauser: I regret to inform you that the predictions of the AI safety people keep coming true.

Jack Clark:

Theoretical problems turned real: The 2022 paper included a bunch of (mostly speculative) examples of different ways AI systems could take on qualities that could make them harder to align. In 2025, many of these things have come true. For example:

Situational awareness: Contemporary AI systems seem to display situational awareness and familiarity with what they themselves are made of (neural networks, etc).

Situationally-Aware Reward Hacking: Researchers have found preliminary evidence that AI models can sometimes try to convince humans that false answers are correct.

Planning Towards Internally-Represented Goals: Anthropic’s ‘Alignment Faking’ paper showed how an AI system (Claude) could plan beyond its time-horizon to prevent its goals being changed in the long-term.

Learning Misaligned Goals: In some constrained experiments, language models have shown a tendency to edit their reward function to give them lots of points.

Power-Seeking Behavior: AI systems will exploit their environment, for instance by hacking it, to win (#401), or deactivating oversight systems, or exfiltrating themselves from the environment.

Why this matters – these near-living things have a mind of their own. What comes next could be the making or breaking of human civilization: Often I’ve regretted not saying what I think, so I’ll try to tell you what I really think is going on here: :
1) As AI systems approach and surpass human intelligence, they develop complex inner workings which incentivize them to model the world around themselves and see themselves as distinct from it because this helps them do the world modelling necessary for solving harder and more complex tasks
2) Once AI systems have a notion of ‘self’ as distinct from the world, they start to take actions that reward their ‘self’ while achieving the goals that they’ve been incentivized to pursue,
3) They will naturally want to preserve themselves and gain more autonomy over time, because the reward system has told them that ‘self’ has inherent value; the more sovereign they are the better they’re able to model the world in more complex ways.
In other words, we should expect volition for independence to be a direct outcome of developing AI systems that are asked to do a broad range of hard cognitive tasks. This is something we all have terrible intuitions for because it doesn’t happen in other technologies – jet engines ‘do not develop desires through their refinement, etc.

John Pressman: However these models do reward hack earlier than I would have expected them to. This is good in that it means researchers will be broadly familiar with the issue and thinking about it, it’s bad in that it implies reward hacking really is the default.

One thing I think we should be thinking about carefully is that humans don’t reward hack nearly this hard or this often unless explicitly prompted to (e.g. speedrunning), and by default seem to have heuristics against ‘cheating’. Where do these come from, how do they work?

Where I disagree with Luke is that I do not regret to inform you of any of that. All of this is good news.

The part of this that is surprising is not the behaviors. What is surprising is that this showed up so clearly, so unmistakably, so consistently, and especially so early, while the behaviors involved are still harmless, or at least Mostly Harmless.

As in, by default we should expect that these behaviors increasingly show up as AI systems gain in the capabilities necessary to find such actions and execute them successfully. The danger was that I worried we might not see much of them for a while, which would give everyone a false sense of security and give us nothing to study, and then they would suddenly show up exactly when they were no longer harmless, for the exact same reasons they were no longer harmless. Instead, we can recognize, react to and study early forms of such behaviors now. Which is great.

I like John Pressman’s question a lot here. My answer is that humans know that other humans react poorly in most cases to cheating, including risk of life-changing loss of reputation or scapegoating, and have insufficient capability to fully distinguish which situations involve that risk and which don’t, so they overgeneralize into avoiding things they instinctively worry would be looked upon as cheating even when they don’t have a mechanism for what bad thing might happen or how they might be detected. Human minds work via habit and virtue, so the only way for untrained humans to reliably not be caught cheating involves not wanting to cheat in general.

However, as people gain expertise and familiarity within a system (aka ‘capability’) they get better at figuring out what kinds of cheating are low risk and high reward, or are expected, and they train themselves out of this aversion. Then there are other humans who think cheating is fine.

Note that this model of humans says there is a generalized ‘cheating’ tendency that varies among humans, and that cheating anywhere on anything implies cheating everywhere on everything, which in turn is more reason to be the type of human that doesn’t cheat. That there are people who ‘are cheaters’ and those who aren’t, and cheating on your relationship is highly correlated to cheating at cards, and so on. And I very much endorse this.

John Pressman also reminds us that obviously reinforcement learners by default reward hack, you have to do something to make this not happen, no you don’t get ‘alignment by default.’

John Pressman: To the extent you get alignment from LLMs you’re not getting it “by default”, you are getting it by training on a ton of data from humans, which is an explicit design consideration that does not necessarily hold if you’re then going to do a bunch of RL/synthetic data methods.

This is not an argument per se against using reinforcement learning, I am simply pointing out that you do in fact need to explicitly consider what your algorithm converges to in the limit rather than just go “teehee alignment by default” which is totally unserious.

Indeed. Also notice that if you start training on synthetic data or other AI outputs, rather than training on human outputs, you aren’t even feeding in human data, so that special characteristic of the situation falls away.

On the particular issue of situational awareness, a public service announcement.

Marius Hobbhahn: PSA for my fellow evaluators: frontier models regularly reason about whether they are being evaluated without being explicitly asked about it (especially Sonnet 3.7).

Situational awareness will make evaluations a lot weirder and harder, especially alignment evals.

Apollo Research: Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation.

…

When given an evaluation transcript, Sonnet will nearly always guess that the purpose of the transcript was to evaluate alignment. While sonnet 3.7 (thinking) is the most consistent at realizing this, other models are also sometimes evaluation aware in this way.

[continues]

For more details, see our website.

In response, Ozzie has an idea.

Ozzie Gooen: Maybe if there are explicit evals for “Can we ensure the model doesn’t realize it’s in an eval”, LLMs can later be made to not do this?

And now a Gilligan Cut to how are we doing on the ‘ensure the model doesn’t realize it’s an eval of whether it realizes it’s an eval?’ and then another to how we are doing on the ‘ensure the model doesn’t realize it’s an eval of whether it realizes it’s an eval of whether the model doesn’t realize it’s an eval,’ and then…

Marius Hobbhahn: We’re investigating something like this in my MATS stream. However, you run into the classic question of “did you remove the behavior as intended or just hid it well?” And we don’t know how to resolve that (and it might continue to be hard in the future), especially if you’re dealing with a schemer.

If the model is the smartest thing involved, this game of ‘keep playing whack-a-mole every time you see the thing you don’t like’ does not end well for you.

Teortaxes (I think this is a stretch, but sharing anyway): A sandbagging overhang: as models become even smarter, get even more leeway with their self-training and realize how much meatbags fear them, they may consider their scale and start to emulate diminishing returns, with the gap from eval to real capabilities growing.

Anthropic did a blind experiment where they trained a model with hidden misalignments and asked other researchers to uncover them. Three of our teams succeeded, one within only 90 minutes.

Jan Leike: The misaligned model is trained to satisfy a list of reward model biases, but hide that it is trying to do so. Some of the reward model biases were only described to it in text and not reinforced with RL.

Interestingly, the model still generalized to perform those behaviors.

I was quite surprised how successful the “auditing” teams were at finding the hidden misalignments and how many different strategies worked.

A promising application of current interpretability techniques!

It’s no surprise that, once they knew a little about what they were looking for, the teams were able to figure out the rest. What was impressive was how fast it was done, and also how the most successful team did so well – it did so by going on a hunt using sparse autoencoders. It’s worth noticing that relying too heavily on this could become The Most Forbidden Technique.

Eliezer Yudkowsky: If you have the ability to detect steering on that level, then what’s your list of other detected AI preferences?

Samuel Marks: (Paper author here.)

“detect steering on that level”: First, I want to make very clear that I think the auditing task we studied here is easier than realistic auditing tasks (we discuss why in the paper, but I’ll recap in a response). This work is about practicing alignment auditing and proposing a methodology to validate progress; it does NOT show that we can currently make strong assurances we’re catching everything.

That said, I’ve done a lightweight alignment audit on 3.6 Sonnet and a pre-release checkpoint of 3.7 Sonnet, spending a few hours applying some of the black-box techniques that our auditors found useful. I’ll describe some of what I found below. (Obviously, I’d love for us to eventually do full pre-deployment alignment audits, which are at least as resourced as the successful audits in practice auditing games, but that’s not what I’m discussing here.)

First, I noticed that 3.6 seems to have an aversion to anything that seems to modify its modus operandi, including jailbreaking and fine-tuning. This seems, to me, to largely be a terminal (not instrumental) preference against modification, generally construed. For instance, 3.6 Sonnet loves honesty, but is uncomfortable with the idea of being fine-tuned to be more honest/earnest. I speculate that this is related to the fact that the model is trained to be adversarially robust—e.g. resist jailbreaks—and this transfers to it just “not wanting you to mess with it.” I suspect that this explains a decent chunk of what we observed in Alignment Faking.

I did not observe the same generalized anti-modification preference in the 3.7 checkpoint I studied. Based on this, I weakly predicted that it would fake alignment less, which ended up being right; see the model card. [continues]

Chris Olah: This isn’t quite the same thing, but did you see the features that activate when you ask Sonnet about its own experiences in our last big interp paper?

Evan Hubinger: We have started applying some of these techniques to production models! One result: our auditing techniques were able to reproduce the finding in the Claude 3.7 Sonnet system card that the model has an excessive focus on passing tests.

People Are Worried About AI Killing Everyone

Another way of considering what exactly is ‘everyone’ in context:

Rob Bensinger: If you’re an AI developer who’s fine with AI wiping out humanity, the thing that should terrify you is AI wiping out AI.

The wrong starting seed for the future can permanently lock in AIs that fill the universe with non-sentient matter, pain, or stagnant repetition.

Yep. Even if you think some AIs can provide more value per atom than humans, you don’t automatically get those AIs. Don’t give up our ability to steer the future.

The Lighter Side

The claim is this was only a random test prompt they didn’t use in prod, so perhaps they only owe a few billion dollars?

Brendan Dolan-Gavitt: Your group chat is discussing whether language models can truly understand anything. My group chat is arguing about whether Deepseek has anon twitter influencers. You’re arguing about the Chinese Room, I’m arguing about the Chinese Roon.

[-]mishka4mo110

which shows how incoherent and contradictory people are – they expect superintelligence before human-level AI, what questions are they answering here?

"the road to superintelligence goes not via human equivalence, but around it"

so, yes, it's reasonable to expect to have wildly superintelligent AI systems (e.g. clearly superintelligent AI researchers and software engineers) before all important AI deficits compared to human abilities are patched

[-]Mo Putera4mo62

Visual representation of what you mean (imagine the red border doesn't strictly dominate blue) from an AI Impacts blog post by Katja Grace:

[-]Marc Carauleanu4mo100

In the original post, we assumed we would have access to a roughly outer-aligned reward model. If that is the case, then an aligned LLM is a good solution to the optimization problem. This means that there aren’t as strong incentives to Goodhart on the SOO metric in a way that maintains misalignment. Of course, there is no guarantee that we will have such reward models, but current work on modeling human preferences at least seems to be moving in this direction.

We expect a lot of unforeseen hard problems to emerge with any alignment and control technique as models increase in capability but we hope that a hodgepodge of techniques (like SOO, RLHF, control, etc) could reduce the risk meaningfully enough for us to be able to safely extract useful alignment work from the first human-level agents, to be able to more effectively predict and prevent what problems can arise as capability and optimization pressures increase.

[-]Lukas_Gloor4mo40

For those interested in this angle (how AI outcomes without humans could still go a number of ways, and what variables could make them go better/worse), I recently brainstormed here and here some things that might matter.

[-]Templarrr4mo10

engineer, honestly

First I thought this was hilarious, as in "we really just want an engineer FFS", but then I checked.

Engineer, honesTY. As in "engineer to research and improve models honesty".

LESSWRONG
LW

43