“If we don’t build fast enough, then the authoritarian countries could win..”
Am I being asked to choose between AGI/ASI doing whatever Xi Jinping says, and it doing whatever Donald Trump says?
No, you're asked to choose between the "authoritarian" open models that do what you say, and the "democratic" closed models controlled by Amodei et al.
The other problem is in assuming democracy/the will of the people will ever be stable in the AI era, due to the combo of being able to brainwash people with neurotech, combined with being able to remove most of the traditional threats to regimes through aligned/technically safe AIs in the militaries.
(Though here, I'm using the word aligned/technically safe in the sense of it following orders/being aligned to the valueset of a single person, or at most a small group, not humanity as a whole.)
I just tried multiplying 13-digit numbers with o3-mini (high). My approach was to ask it to explain a basic multiplication algorithm to me, and then carry it out. On the first try it was lazy and didn't actually follow the algorithm (it just told me "it would take a long time to actually carry out all the shifts and multiplications...", and it got the result wrong.
Then I told it to follow the algorithm, even if it is time consuming, and it did, and the result was correct.
So I'm not sure about the take that
The fact that something that has ingested the entirety of human literature can’t figure out how to generalize multiplication past 13 digits is actually a sign of the fact that it has no understanding of what a multiplication algorithm is.
The model got lazy, did some part of the calculation "in its head" (i.e. not actually following the algorithm but guesstimating the result, like we would do if we were asked to do a task like that without pencil and paper), and got the result slightly wrong - but when you ask it to actually follow the multiplication algorithm it just explained to me, it can absolutely do it.
I'd be interested in the CoT that led to the incorrect conclusion. If the model actually believed that it's lazy estimation leads to the correct result, that shows that it's overestimating its own capabilities - one could call this a fundamental misunderstanding of multiplication. I know that I'm incorrect when I'm estimating the result in my head, because I understand stuff about multiplication - or one could call it a failure of introspection.
The other possibility is that it didn't care to produce an entirely correct result, and just didn't bother and got lazy.
Yeah, same happens if you ask r1 to do it. It reasons that doing it manually would be too time-consuming and starts trying to find clever workarounds, assumes that there must be some hidden pattern in the numbers that lets you streamline the computation.
Which makes perfect sense: reasoning models weren't trained to be brute-force calculators, they were trained to solve clever math puzzles. So, from the distribution of problems they faced, it's a perfectly reasonable assumption to make that doing it the brute way is the "wrong" way.
The Trump Administration is on the verge of firing all ‘probationary’ employees in NIST, as they have done in many other places and departments, seemingly purely because they want to find people they can fire. But if you fire all the new employees and recently promoted employees (which is that ‘probationary’ means here) you end up firing quite a lot of the people who know about AI or give the government state capacity in AI.
This would gut not only America’s AISI, its primary source of a wide variety of forms of state capacity and the only way we can have insight into what is happening or test for safety on matters involving classified information. It would also gut our ability to do a wide variety of other things, such as reinvigorating American semiconductor manufacturing. It would be a massive own goal for the United States, on every level.
Please, it might already be too late, but do whatever you can to stop this from happening. Especially if you are not a typical AI safety advocate, helping raise the salience of this on Twitter could be useful here.
Do you (or anyone) have any gears as to who is the best person to contact here?
I'm slightly worried about making it salient on twitter because I think the pushback from people who do want them all fired might outweigh whatever good it does.
I called some congresspeople but honestly, I think we should have enough people-in-contact with Elon to say "c'mon man, please don't do that?". I'd guess that's more likely to work than most other things?
Update: In a slack I'm in, someone said:
A friend of mine who works at US AISI advised:
> "My sense is that relevant people are talking to relevant people (don't know specifics about who/how/etc.) and it's better if this is done in a carefully controlled manner."
And another person said:
Per the other thread, a bunch of attention on this from EA/xrisk coded people could easily be counterproductive, by making AISI stick out as a safety thing that should be killed
And while I don't exactly wanna trust "the people behind the scenes have it handled", I do think the failure mode here seems pretty real.
EnigmaEval from ScaleAI and Dan Hendrycks, a collection of long, complex reasoning challenges, where AIs score under 10% on the easy problems and 0% on the hard problems.
I do admit, it’s not obvious developing this is helping?
Holly Elmore: I can honestly see no AI Safety benefit to this at this point in time. Once, ppl believed eval results would shock lawmakers into action or give Safety credibility w/o building societal consensus, but, I repeat, THERE IS NO SCIENTIFIC RESULT THAT WILL DO THE ADVOCACY WORK FOR US.
People simply know too little about frontier AI and there is simply too little precedent for AI risks in our laws and society for scientific findings in this area to speak for themselves. They have to come with recommendations and policies and enforcement attached.
Jim Babcock: Evals aren’t just for advocacy. They’re also for experts to use for situational awareness.
So I told him it sounded like he was just feeding evals to capabilities labs and he started crying.
I’m becoming increasingly skeptical of benchmarks like this as net useful things, because I despair that we can use them for useful situational awareness. The problem is: They don’t convince policymakers. At all. We’re learning that. So there’s no if-then action plan here. There’s no way to convince people that success on this eval should cause them to react.
I think the main value add at this point is to lower bound the capabilities for when AI safety can be automated and to upper bound capabilities for AI safety cases (very broadly), but yes the governance value of evals has declined, and there's no plausible way for evals to help with governance in short timelines.
This is more broadly downstream of how governance has become mostly worthless in short timelines, due to basically all the major powers showing e/acc tendencies towards AI (though notably without endorsing human extinction), so technical solutions to alignment are more valuable than they once were.
The only governance is what the labs choose.
In particular, we cannot assume any level of international coordination going forward, and must assume that the post-World War II order that held up international cooperation to prevent x-risks as an anomaly, not something enduring.
Re vibe shifts:
I especially appreciate Wildeford’s #1 point, that the vibes have shifted and will shift again. How many major ‘vibe shifts’ have there been in AI? Seems like at least ChatGPT, GPT-4, CAIS statement, o1 and now DeepSeek with a side of Trump, or maybe it’s the other way around. You could also consider several others.
Whereas politics has admittedly only had ‘vibe shifts’ in, let’s say, 2020, 2021 and then in 2024. So that’s only 3 of the last 5 years (how many happened in 2020-21 overall is an interesting debate). But even with only 3 that still seems like a lot, and history is accelerating rapidly. None of the three even involved AI.
It would not surprise me if the current vibe in AI is different as soon as two months from now even if essentially nothing not already announced happens, where we spent a few days on Grok 3, then OpenAI dropped the full o3 and GPT-4.5, and a lot more people both get excited and also start actually worrying about their terms of employment.
I do not expect this for the next 4 years, and conditional on short timelines (which is 5 years away here), the vibe shift will be too late to matter, IMO.
A big reason for this is I expect politics and the news to focus on the least relevant stuff by a mile, and bring any AI stuff way down until they are replaced, but at that point the AI takeoff is likely in full swing, so we are either doomed to be extinct we survive in a new order.
The Trump Administration is on the verge of firing all ‘probationary’ employees in NIST, as they have done in many other places and departments, seemingly purely because they want to find people they can fire. But if you fire all the new employees and recently promoted employees (which is that ‘probationary’ means here) you end up firing quite a lot of the people who know about AI or give the government state capacity in AI.
This would gut not only America’s AISI, its primary source of a wide variety of forms of state capacity and the only way we can have insight into what is happening or test for safety on matters involving classified information. It would also gut our ability to do a wide variety of other things, such as reinvigorating American semiconductor manufacturing. It would be a massive own goal for the United States, on every level.
Please, it might already be too late, but do whatever you can to stop this from happening. Especially if you are not a typical AI safety advocate, helping raise the salience of this on Twitter could be useful here.
Also there is the usual assortment of other events, but that’s the big thing right now.
Table of Contents
I covered Grok 3 yesterday, I’m holding all further feedback on that for a unified post later on. I am also going to push forward coverage of Google’s AI Co-Scientist.
Language Models Offer Mundane Utility
OpenAI guide to prompting reasoning models, and when to use reasoning models versus use non-reasoning (“GPT”) models. I notice I haven’t called GPT-4o once since o3-mini was released, unless you count DALL-E.
Determine who won a podcast.
What to call all those LLMs? Tyler Cowen has a largely Boss-based system, Perplexity is Google (of course), Claude is still Claude. I actually call all of them by their actual names, because I find that not doing that isn’t less confusing.
Parse all your PDFs for structured data with Gemini Flash 2.0, essentially for free.
Identify which grants are ‘woke science’ and which aren’t rather than literally using keyword searches, before you, I don’t know, destroy a large portion of American scientific funding including suddenly halting clinical trials and longs term research studies and so on? Elon Musk literally owns xAI and has unlimited compute and Grok-3-base available, it’s impossible not to consider failure to use this to be malice at this point.
Tyler Cowen suggests teaching people how to work with AI by having students grade worse models, then have the best models grade the grading. This seems like the kind of proposal that is more to be pondered in theory than in practice, and wouldn’t survive contact with the enemy (aka reality), people don’t learn this way.
Hello, Operator? Operate replit agent and build me an app.
Also, I hate voice commands for AI in general but I do think activating Operator by saying ‘hello, operator’ falls into I Always Wanted to Say That.
Turn more text into less text, book edition.
If you’re not a stickler for the style and quality, we’re there already, and we’re rapidly getting closer, especially on style. But also, often when I want to read the 80% compressed version, it’s exactly because I want a different, denser style.
Indeed, recently I was given a book and told I had to read it. And a lot of that was exactly that it was a book with X pages, that could have told me everything in X/5 pages (or at least definitely X/2 pages) with no loss of signal, and while being far less infuriating. Perfect use case. And the entire class of ‘business book’ feels exactly perfect for this.
Whereas the books actually worth reading, the ones I end up reviewing? Hell no.
A list of the words especially characteristic of each model.
A suggested method to improve LLM debugging:
Language Models Don’t Offer Mundane Utility
Somehow we continue to wait for ‘ChatGPT but over my personal context, done well.’
I’m not quite saying to Google that You Had One Job, but kind of, yeah. None of the offerings here, as far as I can tell, are any good? We all (okay, not all, but many of us) want the AI that has all of our personal context and can then build upon it or sort through it or transpose and organize it, as requested. And yes, we have ‘dump your PDFs into the input and get structured data’ but we don’t have the thing people actually want.
Reliably multiply (checks notes for new frontier) 14 digit numbers.
Have you met a human trying to reliably multiply numbers? How does that go? ‘It doesn’t understand multiplication’ you say as AI reliably crushes humans in the multiplication contest, search and replace [multiplication → all human labor].
Standard reminder about goal posts.
The hallucination objection isn’t fully invalid quite yet the way we use it, but as I’ve said the same is true for humans. At this point I expect the effective ‘hallucination’ rate for LLMs to be lower than that for humans, and for them to be more predictable and easier to spot (and to verify).
Tyler Cowen via Petr quotes Baudrillard on AI, another perspective to note.
Klarna, which is so gung ho about AI replacing humans, now saying ‘in a world of AI nothing will be as valuable as humans!’ I honestly can’t make sense of what they’re talking about at this point, unless it’s that Klarna was never really AI, it’s three basic algorithms in a trenchcoat. Who knows.
Rug Pull
RIP Humane AI, or maybe don’t, because they’re essentially bricking the devices.
As usual, it would not cost that much to do right by your
suckerscustomers and let their devices keep working, but they do not consider themselves obligated, so no. We see this time and again, no one involved who has the necessary authority cares.The question was asked how this is legal. If it were up to me and you wanted to keep the money you got from selling the company, it wouldn’t be. Our laws disagree.
We’re In Deep Research
Have o1-Pro give you a prompt to have Deep Research do Deep Research on Deep Research prompting, use that to create prompt templates for Deep Research. The results are here in case you want to try the final form.
Wait, is it a slop world after all?
I think the reconciliation is: Slop is not bad.
Is AI Art at its current level as good as human art by skilled artists? Absolutely not.
But sometimes the assignment is, essentially, that you want what an actually skilled person would call slop. It gets the job done. Even you, a skilled person who recognizes what it is, can see this. Including being able to overlook the ways in which it’s bad, and focus on the ways in which it is good, and extract the information you want, or get a general sense of what is out there.
Here are his examples, he describes the results. They follow my pattern of how this seems to work. If you ask for specific information, beware hallucinations of course but you probably get it, and there’s patterns to where it hallucinates. If you want an infodump but it doesn’t have to be complete, just give me a bunch of info, that’s great too. It’s in the middle, where you want it to use discernment, that you have problems.
Here’s Alex Rampell using it for navigating their medical issues and treatment options and finding it a godsend, but no details. Altman and Brockman highlighted it, so this is obviously highly selected.
Daniel Litt asks DR to look at 3,000 papers in Annals to compile statistics on things like age of the authors, and it produced a wonderful report, but it turns out it was all hallucinated. The lesson is perhaps not to ask for more than the tool can handle.
Here’s Siqi Chen reporting similarly excellent results.
Meanwhile, the manifesting failure caucus.
Huh, Upgrades
Gemini Advanced (the $20/month level via Google One) now has retrieval from previous conversations. The killer apps for them in the $20 level are the claim it will seamlessly integrate with Gmail and Docs plus the longer context and 2TB storage and their version of Deep Research, along with the 2.0 Pro model, but I haven’t yet seen it show me that it knows how to search my inbox properly – if it could do that I’d say it was well worth it.
I suppose I should try again and see if it is improved. Seriously, they need to be better at marketing this stuff, I actually do have access and still I mostly don’t try it.
There has been a vibe shift for GPT-4o, note that since this Grok 3 has now taken the #1 spot on Arena.
As I said with Grok, I don’t take Arena that seriously in detail, but it is indicative.
Knowledge cutoff moved from November 2023 to June 2024, image understanding improved, they claim ‘a smarter model, especially for STEM’ plus (oh no) increased emoji usage.
Pliny gives us the new system prompt, this is the key section, mostly the rest isn’t new:
That is indeed most of us engage in ‘authentic’ conversation. It’s an ‘iffy’ demand but we do it all the time, and indeed then police it if people seem insufficiently authentic. See Carnegie and How to Win Friends and Influence People. And I use the ‘genuinely curious’ language in my own Claude prompt, although I say ‘ask questions only if you are genuinely curious’ rather than asking for one unit of genuine curiosity, and assume that it means in-context curiosity rather than a call for what it is most curious about in general.
Then again, there’s also the ‘authenticity is everything, once you can fake that you’ve got it made’ attitude.
Yep. You do want to learn how to be more often genuinely interested, but also you need to learn how to impersonate the thing, too, fake it until you make it or maybe just keep faking it.
We are all, each of us, at least kind of faking it all the time, putting on social masks. It’s mostly all over the training data and it is what people prefer. It seems tough to not ask an AI to do similar things if we can’t even tolerate humans who don’t do it at all.
The actual question is Eliezer’s last line. Are we treating it as okay that the inputs and outputs here are lies? Are they lies? I think this is importantly different than lying, but also importantly different from a higher truth standard we might prefer, but which gives worse practical results, because it makes it harder to convey desired vibes.
The people seem to love it, mostly for distinct reasons from all that.
OpenAI’s decision to stealth update here is interesting. I am presuming it is because we are not too far from GPT-4.5, and they don’t want to create too much hype fatigue.
One danger is that when you change things, you break things that depend on them, so this is the periodic reminder that silently updating how your AI works, especially in a ‘forced’ update, is going to need to stop being common practice, even if we do have a version numbering system (it’s literally to attach the date of release, shudder).
Having the ‘version from date X’ option seems like the stopgap. My guess is it would be better to not even specify the exact date of the version you want, only the effective date (e.g. I say I want 2025-02-01 and it gives me whatever version was current on February 1.)
Seeking Deeply
Perplexity open sources R1 1776, a version of the DeepSeek model post-trained to ‘provide uncensored, unbiased and factual information.’
This is the flip side to the dynamic where whatever alignment or safety mitigations you put into an open model, it can be easily removed. You can remove bad things, not only remove good things. If you put misalignment or other information mitigations into an open model, the same tricks will fix that too.
DeepSeek is now banned on government devices in Virginia, including GMU, the same way they had previously banned any applications by ByteDance or Tencent, and by name TikTok and WeChat.
University of Waterloo tells people to remove the app from their devices.
DeepSeek offers new paper on Native Sparse Attention.
DeepSeek shares its recommended settings, its search functionality is purely a prompt.
Intellimint explains what DeepSeek is good for.
A reported evaluation of DeepSeek from inside Google, which is more interesting for its details about Google than about DeepSeek.
It does seem correct that Gemini 2.0 outperforms DeepSeek in general, for any area in which Google will allow Gemini to do its job.
Odd to ask about xAI and not Anthropic, given Anthropic has 24% of the enterprise market versus ~0%, and Claude being far better than Grok so far.
Fun With Multimedia Generation
Janus updates that Suno v4 is pretty good actually, also says it’s Suno v3.5 with more RL which makes the numbering conventions involved that much more cursed.
The Art of the Jailbreak
Anthropic concludes its jailbreaking competition. One universal jailbreak was indeed found, $55k in prizes given to 4 people.
Prompt injecting Anthropic’s web agent into doing things like sending credit card info is remarkably easy. This is a general problem, not an Anthropic-specific problem, and if you’re using such agents for now you need to either sandbox them or ensure they only go to trusted websites.
Andrej Karpathy notes that he can do basic prompt injections with invisible bytes, but can’t get it to work without explicit decoding hints.
High school student extracts credit card information of others from ChatGPT.
Get Involved
UK AISI starting a new AI control research team, apply for lead research scientist, research engineer or research scientist. Here’s a thread from Geoffrey Irving laying out their plan, and explaining how their unique position of being able to talk to all the labs gives them unique insight. The UK AISI is stepping up right when the US seems poised to gut our own AISI and thus AI state capacity for no reason.
Victoria Krakovna announces a short course in AI safety from Google DeepMind.
DeepMind is hiring safety and alignment engineers and scientists, deadline is February 28.
Thinking Machines
Mira Murati announces her startup will be Thinking Machines.
They have a section on safety.
Model specs implicitly excludes model weights, so this could be in the sweet spot where they share only the net helpful things.
The obvious conflict here is between ‘model intelligence as the cornerstone’ and the awareness of how crucial that is and the path to AGI/ASI, versus the product focus on providing the best mundane utility and on human collaboration. I worry that such a focus risks being overtaken by events.
That doesn’t mean it isn’t good to have top tier people focusing on collaboration and mundane utility. That is great if you stay on track. But can this focus survive? It is tough (but not impossible) to square this with the statement that they are ‘building models at the frontier of capabilities in domains like science and programming.’
You can submit job applications here. That is not an endorsement that working there is net positive or even not negative in terms of existential risk – if you are considering this, you’ll need to gather more information and make your own decision on that. They’re looking for product builders, machine learning experts and a research program manager. It’s probably a good opportunity for many from a career perspective, but they are saying they potentially intend to build frontier models.
Introducing
EnigmaEval from ScaleAI and Dan Hendrycks, a collection of long, complex reasoning challenges, where AIs score under 10% on the easy problems and 0% on the hard problems.
I do admit, it’s not obvious developing this is helping?
So I told him it sounded like he was just feeding evals to capabilities labs and he started crying.
I’m becoming increasingly skeptical of benchmarks like this as net useful things, because I despair that we can use them for useful situational awareness. The problem is: They don’t convince policymakers. At all. We’re learning that. So there’s no if-then action plan here. There’s no way to convince people that success on this eval should cause them to react.
SWE-Lancer, a benchmark from OpenAI made up of over 1,400 freelance software engineering tasks from Upwork.
Show Me the Money
Has Europe’s great hope for AI missed its moment? I mean, what moment?
We do get this neat graph.
I did not realize Mistral convinced a full 6% of the enterprise market. Huh.
In any case, it’s clear that the big winner here is Anthropic, with their share in 2024 getting close to OpenAI’s. I presume with all the recent upgrades and features at OpenAI and Google that Anthropic is going to have to step it up and ship if they want to keep this momentum going or even maintain share, but that’s pretty great.
Maybe their not caring about Claude’s public mindshare wasn’t so foolish after all?
So where does it go from here?
I don’t think there is a contradiction here, although I do agree with ‘somewhat at odds’ especially for the base projection. This is the ‘you get AGI and not that much changes right away’ scenario that Sam Altman and to a large extent also Dario Amodei have been projecting, combined with a fractured market.
There’s also the rules around projections like this. Even if you expect 50% chance of AGI by 2027, and then to transform everything, you likely don’t actually put that in your financial projections because you’d rather not worry about securities fraud if you are wrong. You also presumably don’t want to explain all the things you plan to do with your new AGI.
In Other AI News
OpenAI board formally and unanimously rejects Musk’s $97 billion bid.
OpenAI asks what their next open source project should be:
I am as you would expect severely not thrilled with this direction.
I believe doing the o3-mini open model would be a very serious mistake by OpenAI, from their perspective and from the world’s. It’s hard for the release of this model to be both interesting and not harmful to OpenAI (and the rest of us).
A phone-sized open model is less obviously a mistake. Having a gold standard such model that was actually good and optimized to do phone-integration tasks is a potential big Mundane Utility win, with much lesser downside risks.
Peter Wildeford offers 10 takes on the Paris AI
TradeshowAnti-Safety Summit. He attempts to present things, including Vance’s speech, as not so bad, ‘he makes some good points’ and all that.But his #6 point is clear: ‘The Summit didn’t do the one thing it was supposed to do.’
I especially appreciate Wildeford’s #1 point, that the vibes have shifted and will shift again. How many major ‘vibe shifts’ have there been in AI? Seems like at least ChatGPT, GPT-4, CAIS statement, o1 and now DeepSeek with a side of Trump, or maybe it’s the other way around. You could also consider several others.
Whereas politics has admittedly only had ‘vibe shifts’ in, let’s say, 2020, 2021 and then in 2024. So that’s only 3 of the last 5 years (how many happened in 2020-21 overall is an interesting debate). But even with only 3 that still seems like a lot, and history is accelerating rapidly. None of the three even involved AI.
It would not surprise me if the current vibe in AI is different as soon as two months from now even if essentially nothing not already announced happens, where we spent a few days on Grok 3, then OpenAI dropped the full o3 and GPT-4.5, and a lot more people both get excited and also start actually worrying about their terms of employment.
I do think the pause letter in particular was a large mistake, but I very much don’t buy the ‘should have saved all your powder until you saw the whites of their
nanobotseyes’ arguments overall. Not only did we have real chances to make things go different ways at several points, we absolutely did have big cultural impacts, including inside the major labs.Consider how much worse things could have gone, if we’d done that, and let nature take its course but still managed to have capabilities develop on a similar schedule. That goes way, way beyond the existence of Anthropic. Or alternatively, perhaps you have us to thank for America being in the lead here, even if that wasn’t at all our intention, and the alternative is a world where something like DeepSeek really is out in front, with everything that would imply.
Peter also notes that Mistral AI defaulted on their voluntary commitment to issue a (still voluntary!) safety framework. Consider this me shaming them, but also not caring much, both because they never would have meaningfully honored it anyway or offered one with meaningful commitments, and also because I have zero respect for Mistral and they’re mostly irrelevant.
Peter also proposes that it is good for France to be a serious competitor, a ‘worthy opponent.’ Given the ways we’ve already seen the French act, I strongly disagree, although I doubt this is going to be an issue. I think they would let their pride and need to feel relevant and their private business interests override everything else, and it’s a lot harder to coordinate with every real player you add to the board.
Mistral in particular has already shown it is a bad actor that breaks even its symbolic commitments, and also has essentially already captured Macron’s government. No, we don’t want them involved in this.
Much better that the French invest in AI-related infrastructure, since they are willing to embrace nuclear power and this can strengthen our hand, but not try to spin up a serious competitor. Luckily, I do expect this in practice to be what happens.
Seb Krier tries to steelman France’s actions, saying investment to maintain our lead (also known by others as ‘win the race’) is important, so it made sense to focus on investment in infrastructure, whereas what can you really do about safety at this stage, it’s too early.
And presumably (my words) it’s not recursive self-improvement unless it comes from the Resuimp region of Avignon, otherwise it’s just creating good jobs. It is getting rather late to say it is still too early to even lay foundation for doing anything. And in this case, it was more than sidelining and backburnering, it was active dismantling of what was already done.
Paul Rottger studies political bias in AI models with the new IssueBench, promises spicy results and delivers entirely standard not-even-white-guy-spicy results. That might largely be due to choice of models (Llama-8B-70B, Qwen-2.5-7-14-72, OLMo 7-13 and GPT-4o-mini) but You Should Know This Already:
Note that it’s weird to have the Democratic positions be mostly on the right here!
The training set really is ‘to the left’ (here to the right on this chart) of even the Democratic position on a lot of these issues. That matches how the discourse felt during the time most of this data set was generated, so that makes sense.
I will note that Paul Rottger seems to take a Moral Realist position in all this, essentially saying that Democratic beliefs are true?
Or is the claim here that the models were trained for left-wing moral foundations to begin with, and to disregard right-wing moral foundations, and thus the conclusion of left-wing ideological positions logically follows?
To avoid any confusion or paradox spirits I will clarify that yes I support same-sex marriage as well and agree that it is fair and kind, but Paul’s logic here is assuming the conclusion. It’s accepting the blue frame and rejecting the red frame consistently across issues, which is exactly what the models are doing.
And it’s assuming that the models are operating on logic and being consistent rational thinkers. Whereas I think you have a better understanding of how this works if you assume the models are operating off of vibes. Nuclear power should be a definitive counterexample to ‘the models are logic-based here’ that works no matter your political position.
There are other things on this list where I strongly believe that the left-wing blue position on the chart is objectively wrong, their preferred policy doesn’t lead to good outcomes no matter your preferences, and the models are falling for rhetoric and vibes.
By Any Other Name
One ponders Shakespeare and thinks of Lincoln, and true magick. Words have power.
UK’s AI Safety Institute changes its name to the AI Security Institute, according to many reports because the Trump administration thinks things being safe is so some woke conspiracy, and we can’t worry about anything that isn’t fully concrete and already here, so this has a lot in common with the AITA story of pretending that beans in chili are ‘woke’ except instead of not having beans in chili, we might all die.
I get why one would think it is a good idea. The acronym stays the same, the work doesn’t have to change since it all counts either way, pivot to a word that doesn’t have bad associations. We do want to be clear that we are not here for the ‘woke’ agenda, that is at minimum a completely different department.
But the vibes around ‘security’ also make it easy to get rid of most of the actual ‘notkilleveryoneism’ work around alignment and loss of control and all that. The literal actual security is also important notkilleveryoneism work, we need a lot more of it, but the UK AISI is the only place left right now to do the other work too, and this kind of name change tends to cause people to change the underlying reality to reflect it. Perhaps this can be avoided, but we should have reason to worry.
That’s the worry, it is easy to say ‘security’ does not include the largest dangers.
Ian Hogarth here explains that this is not how they view the term ‘security.’ Loss of control counts, and if loss of control counts in the broadest sense than it should be fine? We shall see.
Perhaps if you’re in AI Safety you should pivot to AI Danger. Five good reasons:
I presume I’m kidding. But these days can one be sure?
Quiet Speculations
If scaling inference compute is the next big thing, what does that imply?
Potentially, if power and impact sufficiently depend on and scale with the amount of inference available compute, rather than in having superior model weighs or other advantages, then perhaps we can ensure the balance of inference compute is favorable to avoid having to do something more draconian.
I do think the scaling of inference compute opens up new opportunities. In particular, it opens up much stronger possibilities for alignment, since you can ‘scale up’ the evaluator to be stronger than the proposer while preserving the evaluator’s alignment, allowing you to plausibly ‘move up the chain.’ In terms of governance, it potentially does mean you can do more of your targeting to hardware instead of software, although you almost certainly want to pursue a mixed strategy.
Scott Sumner asks about the Fertility Crisis in the context of AI. If AI doesn’t change everything, one could ask, what the hell is China going to do about this:
As I discuss in my fertility roundups, there are ways to turn this around with More Dakka, by actually doing enough and doing it in ways that matter. But no one is yet seriously considering that anywhere. As Scott notes, if AI does arrive and change everything it will make the previously public debt irrelevant too, so spending a lot to fix the Fertility Crisis only to have AI fix it anyway wouldn’t be a tragic outcome.
I agree that what happens to fertility after AI is very much a ‘we have no idea.’ By default, fertility goes to exactly zero (or undefined), since everyone will be dead, but in other scenarios everything from much higher to much lower is on the table, as is curing aging and the death rate dropping to almost zero.
A good question, my answer is because they cannot Feel the AGI and are uninterested in asking such questions in any serious fashion, and also you shouldn’t imagine such domains as being something that they aren’t and perhaps never were:
Well, here’s a statement I didn’t expect to see from a Senator this week:
The Copium Department
Any time you see a post with a title like ‘If You’re So Smart, Why Can’t You Die’ you know something is going to be backwards. In this case, it’s a collection of thoughts about AI and the nature of intelligence, and it is intentionally not so organized so it’s tough to pick out a central point. My guess is ‘But are intelligences developed by other intelligences, or are they developed by environments?’ is the most central sentence, and my answer is yes for a sufficiently broad definition of ‘environments’ but sufficiently advanced intelligences can create the environments a lot better than non-intelligences can, and we already know about self-play and RL. And in general, there’s what looks to me like a bunch of other confusions around this supposed need for an environment, where no you can simulate that thing fine if you want to.
Another theme is ‘the AI can do it more efficiently but is more vulnerable to systematic exploitation’ and that is often true now in practice in some senses, but it won’t last. Also it isn’t entirely fair. The reason humans can’t be fooled repeatedly by the same tricks is that the humans observe the outcomes, notice and adjust. You could put that step back. So yeah, the Freysa victories (see point 14) look dumb on the first few iterations, but give it time, and also there are obvious ways to ensure Freysa is a ton more robust that they didn’t use because then the game would have no point.
I think the central error is to conflate ‘humans use [X] method which has advantage of robustness in [Y] whereas by default and at maximum efficiency AIs don’t’ with ‘AIs will have persistent disadvantage [~Y].’ The central reason this is false is because AIs will get far enough ahead they can afford to ‘give back’ some efficiency gains to get the robustness, the same way humans are currently giving up some efficiency gains to get that robustness.
So, again, there’s the section about sexual vs. asexual reproduction, and how if you use asexual reproduction it is more efficient in the moment but hits diminishing returns and can’t adjust. Sure. But come on, be real, don’t say ‘therefore AIs being instantly copied’ is fine, obviously the AIs can also be modified, and self-modified, in various ways to adjust, sex is simply the kludge that lets you do that using DNA and without (on various levels of the task) intelligence.
There’s some interesting thought experiments here, especially around future AI dynamics and issues about Levels of Friction and what happens to adversarial games and examples when exploits scale very quickly. Also some rather dumb thought experiments, like the ones about Waymos in rebellion.
Also, it’s not important but the central example of baking being both croissants and bagels is maddening, because I can think of zero bakeries that can do a good job producing both, and the countries that produce the finest croissants don’t know what a bagel even is.
Firing All ‘Probationary’ Federal Employees Is Completely Insane
On must engage in tradeoffs, along the Production Possibilities Frontier, between various forms of AI safety and various forms of AI capability and utility.
The Trump Administration has made it clear they are unwilling to trade a little AI capability to get a lot of any form of AI safety. AI is too important, they say, to America’s economic, strategic and military might, innovation is too important.
That is not a position I agree with, but (up to a point) it is one I can understand.
If one believed that indeed AI capabilities and American AI dominance were too important to compromise on, one would not then superficially pinch pennies and go around firing everyone you could.
Instead, one would embrace policies that are good for both AI capabilities and AI safety. In particular we’ve been worried about attempts to destroy US AISI, whose purpose is both to help labs run better voluntary evaluations and to allow the government to understand what is going on. It sets up the government AI task forces. It is key to government actually being able to use AI. This is a pure win, and also the government is necessary to be able to securely and properly run these tests.
Preserving AISI, even with different leadership, is the red line, between ‘tradeoff I strongly disagree with’ and ‘some people just want to watch the world burn.’
We didn’t even consider that it would get this much worse than that. I mean, you would certainly at least make strong efforts towards things like helping American semiconductor manufacturing and ensuring AI medical device builders can get FDA approvals and so on. You wouldn’t just fire all those people for the lulz to own the libs.
Well, it seems Elon Musk would, actually? It seems DOGE is on the verge of crippling our state capacity in areas crucial to both AI capability and AI safety, in ways that would do severe damage to our ability to compete. And we’re about to do it, not because of some actually considered strategy, but simply because the employees involved have been hired recently, so they’re fired.
Which includes most government employees working on AI, because things are moving so rapidly. So we are now poised to cripple our state capacity in AI, across the board. This would be the most epic of self-inflicted wounds.
This extends to such insanity as ‘fire the people in charge of approving AI medical devices,’ as if under the illusion that this means the devices get approved, as opposed to what it actually does, which is make getting approvals far more difficult. When the approvers go away, you don’t suddenly stop needing to get approval, you just can’t get it.
The ‘good news’ is that this is a sense ‘not personal,’ it’s not that they hate AI safety. It’s that they hate the idea of the government having employees, whether they’re air traffic controllers, ensuring we can collect taxes or monitoring bird flu.
Perhaps if Elon Musk tried running all his proposed firings through Grok 3 first we wouldn’t be in this situation.
The Quest for Sane Regulations
Demis Hassabis (CEO DeepMind) continues to advocate for ‘a kind of CERN for AGI.’ Dario Amodei confirms he has similar thoughts.
Dean Ball warns about a set of remarkably similar no-good very-bad bills in various states that would do nothing to protect against AI’s actual risks or downsides. What they would do instead is impose a lot of paperwork and uncertainty for anyone trying to get mundane utility from AI in a variety of its best use cases. Anyone doing that would have to do various things to document they’re protecting against ‘algorithmic discrimination,’ in context some combination of a complete phantom and a type mismatch, a relic of a previous vibe age.
How much burden would actually be imposed in practice? My guess is not much, by then you’ll just tell the AI to generate the report for you and file it, if they even figure out an implementation – Colorado signed a similar bill a year ago and it’s in limbo.
But there’s no upside here at all. I hope these bills do not pass. No one in the AI NotKillEveryoneism community has anything to do with these bills, or to my knowledge has any intention of supporting them. We wish the opposition good luck.
Anton Leicht seems to advocate for not trying to advance actual safety or even advocate for it much at all for risk of poisoning the well further, without offering an alternative proposal that might actually make us not die even if it worked. There’s no point in advocating for things that don’t solve the problem, and no I don’t think sitting around and waiting for higher public salience (which is coming, and I believe much sooner than Anton thinks) without laying a foundation to be able to do anything is much of a strategy either.
Mark Zuckerberg goes to Washington to lobby against AI regulations.
Pick Up the Phone
Who cared about safety at the Paris summit? Well, what do you know.
Fu Ying also had this article from 2/12.
That framing seems like it has promise for potential cooperation.
There comes a time when all of us must ask: AITA?
Pick. Up. The. Phone.
They’re on this case too:
Just saying. Also, thank you, China, you love to see it on the object level too.
The Week in Audio
Demis Hassabis and Dario Amodei on Economist’s Tiny Couch.
I continue to despise the adversarial framing (‘authoritarian countries could win’) but (I despair that it is 2025 and one has to type this, we’re so f***ed) at least they are continuing to actually highlight the actual existential risks of what they themselves are building almost as quickly as possible.
I am obviously not in anything like their position, but I can totally appreciate – because I have a lot of it too even in a much less important position – their feeling of the Weight of the World being on them, that the decisions are too big for one person and if we all fail and thus perish that the failure would be their fault. Someone has to, and no one else will, total heroic responsibility.
Is it psychologically healthy? Many quite strongly claim no. I’m not sure. It’s definitely unhealthy for some people. But I also don’t know that there is an alternative that gets the job done. I also know that if someone in Dario’s or Demis’s position doesn’t have that feeling, that I notice I don’t trust them.
Rhetorical Innovation
Many such cases, but fiction plays by different rules.
A proposal to emulate the evil red eyes robots are supposed to have, by having an LLM watchdog that turns the text red if the AI is being evil.
Your periodic reminder, this time from Google DeepMind’s Anca Dragan, that agents will not want to be turned off, and the more they believe we wouldn’t agree with what they are doing and would want to turn them off, the more they will want to not be turned off.
What is this ‘humanity’ that is attempting to turn off the AI? Do all the humans suddenly realize what is happening and work together? The AI doesn’t get compared to ‘humanity,’ only to the efforts humanity makes to shut it off or to ‘fight’ it. So the AI doesn’t have to be ‘more powerful than humanity,’ only loose on the internet in a way that makes shutting it down annoying and expensive. Once there isn’t a known fixed server, it’s software, you can’t shut it down, even Terminator 3 and AfrAId understand this.
A proposed new concept handle:
People Really Dislike AI
They also don’t trust it, not here in America.
Only 32% of Americans ‘trust’ AI according to the 2025 Edelman Trust Barometer. China is different, there 72% of people express trust in AI
Trust is higher for men, for the young and for those with higher incomes.
Only 19% of Americans (and 44% of Chinese) ‘embrace the growing use of AI.’
All of this presumably has very little to do with existential risks, and everything to do with practical concerns well before that, or themes of Gradual Disempowerment. Although I’m sure the background worries about the bigger threats don’t help.
America’s tech companies have seen a trust (in the sense of ‘to do what is right’) decline from 73% to 63% in the last decade. In China they say 87% trust tech companies to ‘do what is right.’
This is tech companies holding up remarkably well, and doing better than companies in general and much better than media or government. Lack of trust is an epidemic. And fears about even job loss are oddly slow to increase.
What does it mean to ‘trust’ AI, or a corporation? I trust Google with my data, to deliver certain services and follow certain rules, but not to ‘do what is right.’ I don’t feel like I either trust or distrust AI, AI is what it is, you trust it in situations where it deserves that.
Aligning a Smarter Than Human Intelligence is Difficult
Add another to the classic list of AI systems hacking the eval:
Those are three potential lessons, but the most important one is that AIs will increasingly engage in these kinds of actions. Right now, they are relatively easy to spot, but even with o3-mini-high able to spot it in 11 seconds once it was pointed to and the claim being extremely implausible on its face, this still fooled a bunch of people for a while.
People Are Worried About AI Killing Everyone
If you see we’re all about to die, for the love of God, say something.
Never play 5D chess, especially with an unarmed opponent.
Are there a non-zero number of people who should be playing 2D chess on this? Yeah, sure, 2D chess for some. But not 3D chess and definitely not 5D chess.
Other People Are Not As Worried About AI Killing Everyone
Intelligence Denialism is totally a thing.
The Lighter Side
Oh no!
On what comedian Josh Johnson might do in response to an AI (3min video) saying ‘I am what happens when you try to carve God from the wood of your own hunger.’
The freakouts are most definitely coming. The questions are when and how big, in which ways, and what happens after that. Next up is explaining to these folks that AIs like DeepSeek’s cannot be shut down once released, and destroying your computer doesn’t do anything.
A flashback finally clipped properly: You may have had in mind the effect on jobs, which is really my biggest nightmare.
Have your people call my people.