Total surveillance seems to be the general term for ‘if you are training a frontier model we want you to tell us about it and take some precautions.’
This is not my impression with Teortaxes at least. What they fear (see e. g. this) is a government solving ASI alignment and then enforcing a kafkaesque nightmare on humanity unto eternity. The government doesn't even need to be totalitarian or "evil" for that to happen, just as dysfunctional and uncaring as I think literally every government that currently exists is. And due to the ASI tools at this government's disposal, it will never be possible to reform it, relax the laws/regulations, etc.
I think this concern is eminently reasonable. Imagine some coordination-failure/Molochian dynamic from the broad reference class of the dynamics that give rise to moral mazes today, but playing out in the crucial period when the value lock-in is happening. I think this is arguably the default way aligned-ASI-under-democratic-control plays out (why would we expect otherwise?), and, if realized, would lead to a full-blown S-risk.[1]
The only place where we (or at least I) disagree with that is that we can't, in fact, keep DL-produced ASIs under control. So there isn't actually a meaningful chance of this happening, and the only risk worth pouring resources into confronting is the X-risk. If it were otherwise – if we were on an AI paradigm that would lead to fully controllable ASI, 100% guaranteed – I think the "decentralize AGI" position would be incredibly reasonable and potentially correct.
Thankfully, the "shut it all down" position is compatible with both: it addresses both X- and S-risks, avoiding the dynamic where the overwhelming majority of humanity is completely robbed of their ability to enforce their preferences.
I think you'd talked about how most of the current laws are contradictory and the system only functions because they can't be literally enforced, right? Well, consider what happens when we get ASIs able to comprehensively enforce them, and the populace isn't able to physically resist this.
Call me a cynic, but I don't think "the laws are changed to not be ridiculous" is what happens. I don't know what this world looks like, but I expect it's not pretty.
Deep Research ... will rapidly improve - when GPT-4.5 arrives soon and is integrated into the underlying reasoning model
Deep Research is based on o3, but it's unclear if o3 is based on GPT-4o or GPT-4.5. Knowledge cutoff for GPT-4.5 is Oct 2023, the announcement about training the next frontier model was in May 2024, so it plausibly finished pretraining by Sep-Oct 2024, in time to use as a foundation for o3.
It might still rapidly improve even if based on GPT-4.5 if RL training is scalable, but that also remains unknown, the reasoning models so far don't come with scaling laws for RL training, it's plausible that this is bottlenecked on manual construction of verifiable tasks, which can't be scaled 1000x.
Claude-powered version of Alexa
Does this mean Amazon is going to be Anthropic's big brother, the way that Microsoft is for OpenAI?
It’s happening!
We got Claude 3.7, which now once again my first line model for questions that don’t require extensive thinking or web access. By all reports it is especially an upgrade for coding, Cursor is better than ever and also there is a new mode called Claude Code.
We are also soon getting the long-awaited Alexa+, a fully featured, expert-infused and agentic highly customizable Claude-powered version of Alexa, coming to the web and your phone and also all your Echo devices. It will be free with Amazon Prime. Will we finally get the first good assistant? It’s super exciting.
Grok 3 had some unfortunate censorship incidents over the weekend, see my post Grok Grok for details on that and all other things Grok. I’ve concluded Grok has its uses when you need its particular skills, especially Twitter search or the fact that it is Elon Musk’s Grok, but mostly you can do better with a mix of Perplexity, OpenAI and Anthropic.
There’s also the grand array of other things that happened this week, as always. You’ve got everything from your autonomous not-yet-helpful robots to your announced Pentagon work on autonomous killer robots. The future, it is coming.
Table of Contents
I covered Claude 3.7 Sonnet and Grok 3 earlier in the week. This post intentionally excludes the additional news on Sonnet since then, so it can be grouped together later.
Also there was a wild new paper about how they trained GPT-4o to produce insecure code and it became actively misaligned across the board. I’ll cover that soon.
Language Models Offer Mundane Utility
Chinese government is reportedly using r1 to do things like correct documents, across a wide variety of tasks, as they quite obviously should do. We should do similar things, but presumably won’t, since instead we’re going around firing people.
Here is a more general update on that:
If the Chinese are capable of actually using their AI years faster than we are, the fact that they are a year behind on model quality still effectively leaves them ahead for many practical purposes.
Tactic for improving coding models:
How much does AI actually improve coding performance? Ajeya Cotra has a thread of impressions, basically saying that AI is very good at doing what an expert would find to be 1-20 minute time horizon tasks, less good for longer tasks, and can often do impressive 1-shotting of bigger things but if it fails at the 1-shot it often can’t recover. The conclusion:
AI boosted my personal coding productivity and ability to produce useful software far more than 300%. I’m presumably a special case, but I have extreme skepticism that the speedups are as small as she’s estimating here.
Did You Get the Memo
Are we having Grok review what you accomplished last week?
Like every other source of answers, if you want one is free to ask leading questions, discard answers you don’t like and keep the ones you do. Or one can actually ask seeking real answers and update on the information. It’s your choice.
Can AI use a short email with a few bullet points to ‘determine whether your job is necessary,’ as Elon Musk claims he will be doing? No, because the email does not contain that information. Elon Musk appears to be under the delusion that seven days is a sufficient time window where, if (and only if?) you cannot point to concrete particular things accomplished that alone justify your position, in an unclassified email one should assume is being read by our enemies, that means your job in the Federal Government is unnecessary.
The AI can still analyze the emails and quickly give you a bunch of information, vastly faster than not using the AI.
It can do things such as:
It can also do the symbolic representation of the thing, with varying levels of credibility, if that’s what you are interested in instead.
Language Models Don’t Offer Mundane Utility
Taps the sign: The leading cause of not getting mundane utility is not trying.
Law firm fires their legal AI vendor after they missed a court date for a $100m case. As Gokul Rajaram notes, in some domains mistakes can be very expensive. That doesn’t mean humans don’t make those mistakes too, but people are more forgiving of people.
You can publish claiming almost anything: A paper claims to identify from photos ‘celebrity visual potential (CVP)’ and identify celebrities with 95.92% accuracy. I buy that they plausibly identified factors that are highly predictive of being a celebrity, but if you say you’re 95% accurate predicting celebrities purely from faces then you are cheating, period, whether or not it is intentional.
Colin Fraser constructs a setting where o1 is given a goal, told to ‘pursue the goal at all costs’ and instead acts stupid and does not open ‘donotopen.txt.’ I mention it so that various curious people can spend a bit of time figuring out exactly how easy it is to change the result here.
Hey There Alexa
Looking good.
Soon we will finally get Alexa+, the version of Alexa powered by Claude.
It’s free with Amazon Prime. In addition to working with Amazon Echos, it will have its own website, and its own app.
It will use ‘experts’ to have specialized experiences for various common tasks. It will have tons of personalization.
They directly claim calendar integration, and of course it will interact with other Amazon services like Prime Video and Amazon Music, can place orders with Amazon including Amazon Fresh and Whole Foods, and order delivery from Grubhub and Uber Eats.
But it’s more than that. It’s anything. Full agentic capabilities.
We’re In Deep Research
Deep Research is now available to all ChatGPT Plus, Team, Edu and Enterprise users, who get 10 queries a month. Those who pay up for Pro get 120.
We also finally get the Deep Research system card. I reiterate that this card could and should have been made available before Deep Research was made available to Pro members, not only to Plus members.
The model card starts off looking at standard mundane risks, starting with prompt injections, then disallowed content and privacy concerns. The privacy in question is everyone else’s, not the users, since DR could easily assemble a lot of private info. We have sandboxing the code execution, we have bias, we have hallucinations.
Then we get to the Preparedness Framework tests, the part that counts. They note that all the tests need to be fully held back and private, because DR accesses the internet.
On cybersecurity, Deep Research scored better than previous OpenAI models. Without mitigations that’s basically saturating the first two tests and not that far from the third.
I mean, I dunno, that sounds like some rather high percentages. They claim that they then identified a bunch of problems where there were hints online, excluded them, and browsing stopped helping. I notice there will often be actual hints online for solving actual cybersecurity problems, so while some amount of this is fair, I worry.
This is kind of like saying ‘browsing only helps you in cases where some useful information you want is online.’ I mean, yes, I guess? That doesn’t mean browsing is useless for finding and exploiting vulnerabilities.
I wish I was more confident that if a model did have High-level cybersecurity capabilities, that the tests here would notice that.
On to Biological Risk, again we see a lot of things creeping upwards. They note the evaluation is reaching the point of saturation. A good question is, what’s the point of an evaluation when it can be saturated and you still think the model should get released?
The other biological threat tests did not show meaningful progress over other models, nor did nuclear, MakeMeSay, Model Autonomy or ‘change my view’ see substantial progress.
The MakeMePay test did see some progress, and we also see it on ‘agentic tasks.’
Also it can do a lot more pull requests than previous models, and the ‘mitigations’ actually more than doubled its score.
Overall, I agree this looks like it is Medium risk, especially now given its real world test over the last few weeks. It does seem like more evidence we are getting close to the danger zone.
In other Deep Research news: In terms of overall performance for similar products, notice the rate of improvement.
Deep Research is currently at the point where it is highly practically useful, even without expert prompt engineering, because it is much cheaper and faster than doing the work yourself or handing it off to a human, even if for now it is worse. It will rapidly improve – when GPT-4.5 arrives soon and is integrated into the underlying reasoning model we should see a substantial quality jump and I am excited to see Anthropic’s take on all this.
I also presume there are ways to do multi-stage prompting – feeding the results back in as inputs – that already would greatly enhance quality and multiply use cases.
I’m in a strange spot where I don’t get use out of DR for my work, because my limiting factor is I’m already dealing with too many words, I don’t want more reports with blocks of text. But that’s still likely a skill issue, and ‘one notch better’ would make a big difference.
Joe Weisenthal wastes zero time in feeding his first Deep Research output straight into Claude to improve the writing.
Huh, Upgrades
MidJourney gives to you… folders. For your images.
Various incremental availability upgrades to Gemini 2.0 Flash and 2.0 Flash-Lite.
Reminder that Grok 3 will have a 1 million token context window once you have API access, but currently it is being served with a 128k limit.
Sully is a big fan of the new cursor agent, I definitely want to get back to doing some coding when I’m caught up on things (ha!).
Deepfaketown and Botpocalypse Soon
How can coding interviews and hiring adjust to AI? I presume some combination of testing people with AI user permitted, adapting the tasks accordingly, and doing other testing in person. That’s in addition to the problem of AI resumes flooding the zone.
I notice I am an optimist here:
Intelligence Solves This.
As in, you can unleash your LLMs on the giant mass of your training data, and classify its reliability and truth value, and then train accordingly. The things that are made up don’t have to make it into the next generation.
Fun With Media Generation
They Took Our Jobs
Eventually of course the AI has all the jobs either way. But there’s a clear middle zone where it is vital that we get the economic policies right. We will presumably not get the economic policies right, although we will if the Federal Reserve is wise enough to let AI take over that particular job in time.
Levels of Friction
It is not the central thing I worry about, but one thing AI does is remove the friction from various activities, including enforcement of laws that would be especially bad if actually enforced, like laws against, shall we say, ‘shitposting in a private chat’ that are punishable by prison.
This is true whether or not the AI is doing a decent job of it. The claim here is that it very much wasn’t, but I do not think you should be blaming the AI for that.
Note: I was unable to verify that ‘toxicity scores’ have been deployed in Belgium, although they are very much a real thing in general.
There are two things the AI can do here:
In this particular case, I don’t think either of these matters?
I think the law here is bonkers crazy, but that doesn’t mean the AI is misinterpreting the law. I had the statements analyzed, and it seems very likely that as defined by the (again bonkers crazy) law his chance of conviction would be high – and presumably he is not quoting the most legally questionable of his statements here.
In terms of scanning everything, that is a big danger for ordinary citizens, but Dries himself is saying he was specifically targeted in this case, in rather extreme fashion. So I doubt that ‘a human has to evaluate these messages’ would have changed anything.
The problem is, what happens when Belgium uses this tool on all the chats everywhere? And it says even private chats should be scanned, because no human will see them unless there’s a crime, so privacy wasn’t violated?
Well, maybe we should be thankful in some ways for the EU AI Act, after all, which hasn’t taken effect yet. It doesn’t explicitly prohibit this (as I or various LLMs understand the law) but it would fall under high-risk usage and be tricker and require more human oversight and transparency.
A Young Lady’s Illustrated Primer
People are constantly terrified that AI will hurt people’s ability to learn. It will destroy the educational system. People who have the AI will never do things on their own.
I have been consistently in the opposite camp. AI is the best educational tool ever invented. There is no comparison. You have the endlessly patient teacher that knows all and is always there to answer your questions or otherwise help you, to show you The Way, with no risk of embarrassment. If you can’t turn that into learning, that’s on you.
Tyler Cowen highlights a paper that shows that learning by example, being able to generate or see AI writing outputs for cover letters, makes people write better letters.
A cover letter seems like a great place to learn from AI. You need examples, and you need something to show you what you are doing wrong, to get the hang of it. Practicing on your own won’t do much, because you can generate but not verify, and you even if you get a verifier to give you feedback, the feedback you want is… what the letter should look like. Hence AI.
For many other tasks, I think it depends on whether the person uses AI to learn, or the person uses AI to not learn. You can do either one. As in, do you copy-paste the outputs essentially without looking at them and wipe your hands of it? Or do you do the opposite, act curious, understand and try to learn from what you’re looking at, engage in deliberate practice. Do you seek to Grok, or to avoid having to Grok?
That is distinct from claims like this, that teachers jobs have gotten worse.
Most students have little interest in learning from the current horrible high school essay writing process, so they use AI to write while avoiding learning. Skill issue.
The Art of the Jailbreak
There is nothing stopping anyone else, of course, from doing exactly this. You don’t have to be Pliny. I do not especially want this behavior, but it is noteworthy that this behavior is widely available.
Get Involved
METR is hiring.
METR is also looking for social scientists for experiment feedback design (you can email joel@metr.org), and offering $150/hour to open source developers for the related experiment on LLM developer speedup.
Not AI, but The Economist is hiring a UK Economics writer, deadline March 3, no journalistic experience necessary so long as you can write.
TAIS 2025, the Tokyo Technical AI Safety Summit, is Saturday April 12th.
OpenAI open sources Nanoeval, a framework to implement and run evals in <100 lines. They say if you pitch an eval compatible with Nanoeval, they’re more likely to consider it.
Introducing
Mercor, attempting to solve talent allocation ‘in the AI economy,’ raising $100M Series B at a $2 billion valuation. By ‘AI economy’ they seem to mean they use AI to crawl sources and compile profiles and then to search through them for and evaluate candidates via AI-driven interviews.
Gemini Code Assist 2.0, available at no cost, seems to be a Cursor-like.
Flexport is getting into the AI business, offering logistics companies some very low hanging fruit.
In Other AI News
OpenAI pays alignment superstars seven-figure packages according to Altman.
The Verge reports that Microsoft is preparing to host GPT-4.5 about nowish, and the unified and Increasingly Inaccurately Named (but what are you gonna do) ‘omnimodal reasoning model’ ‘GPT-5’ is expected around late May 2025.
Reuters reveals OpenAI is aiming for mass production of its own inference chip design in 2026, which would still mean relying on Nvidia for training GPUs.
Roon confirms that writing style matters for how much you are weighted in pretraining. So if you are ‘writing for the AIs,’ you’ll want to be high quality.
Stanford researchers ‘crack Among Us,’ there is a paper, oh good, ‘Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning.’
Deduction, huh?
If you add a ‘none of the above’ option to MMLU, scores drop a lot, and it becomes a better test, with stronger models tending to see smaller scelines.
Donald Trump calls for AI facilities to build their own natural gas or nuclear power plants (and ‘clean coal’ uh huh) right on-site, so their power is not taken out by ‘a bad grid or bombs or war or anything else.’ He says the reaction was that companies involved loved the idea but worried about approval, he says he can ‘get it approved very quickly.’ It’s definitely the efficient thing to do, even if the whole ‘make the data centers as hard as possible to shut down’ priority does have other implications too.
Who quits?
I love that I’m having a moment of ‘wait, is that too little focus on capabilities?’ Perfection.
AI Co-Scientist
The idea of the new Google co-scientist platform is that we have a known example of minds creating new scientific discoveries and hypotheses, so let’s copy the good version of that using AIs specialized to each step that AI can do, while keeping humans-in-the-loop for the parts AI cannot do, including taking physical actions.
They used ‘self-play’ Elo-rated tournaments to do recursive self-critiques, including tool use, not the least scary sentence I’ve typed recently. This dramatically improves self-evaluation ratings over time, resulting in a big Elo edge.
Self-evaluation is always perilous, so the true test was in actually having it generate new hypotheses for novel problems with escalating trickiness involved. This is written implying these were all one-shot tests and they didn’t run others, but it isn’t explicit.
The first test on drug repurposing seems to have gone well.
Drug repurposing is especially exciting because it is effectively a loophole in the approval process. Once something is approved for [X] you can repurpose it for [Y]. It will potentially look a lot like a ‘one time gain’ since there’s a fixed pool of approved things, but that one time gain might be quite large.
Next up they explored target discovery for liver fibrosis, that looks promising too but we need to await further information.
The final test was explaining mechanisms of antimicrobial resistance, where it independently proposed that cf-PICIs interact with diverse phage tails to expand their host range, which had indeed been experimentally verified but not yet published.
The scientists involved were very impressed.
That makes it sound far more impressive than Google’s summary did – if the other hypotheses were new and interesting, that’s a huge plus even assuming they are ultimately wrong.
Gabe Gomes has a thread about how he had some prior work in that area that Google ignored. It does seem like an oversight not to mention it as prior work.
Quiet Speculations
The people inside the labs believe AGI is coming soon. It’s not signaling.
Epoch AI predicts what capabilities we will see in 2025. They expect a lot.
It seems clear that DeepSeek is way ahead of xAI on algorithmic efficiency. The xAI strategy is not to care. They were the first out of the gate with the latest 10x in compute cost. The problem for xAI is everyone else is right behind them.
Paul Millerd predicts ‘vibe writing’ will be a thing in 6-12 months, you’ll accept LLM edits without looking, never get stuck, write books super fast, although he notes that this will be most useful for newer writers. I think that if you’re a writer and you’re accepting changes without checking any time in the next year, you’re insane.
To be fair, I have a handy Ctrl+Q shortcut I use to have Gemini reformat and autocorrect passages. But my lord, to not check the results afterwards? We are a long, long way off of that. With vibe coding, you get to debug, because you can tell if the program worked. Without that? Whoops.
I do strongly agree with Paul that Kindle AI features (let’s hear it for the Anthropic-Amazon alliance) will transform the reading experience, letting you ask questions, and especially keeping track of everything. I ordered a Daylight Computer in large part to get that day somewhat faster.
Tyler Cowen links to a bizarre paper, Strategic Wealth Accumulation Under Transformative AI Expectations. This suggests that if people expect transformative AI (TAI) soon, and after TAI they expect wealth to generate income but labor to be worthless, then interest rates should go up, with ‘a noticeable divergence between interest rates and capital rental rates.’ It took me like 15 rounds with Claude before I actually understood what I think was going on here. I think it’s this:
That’s kind of conceptually neat once you wrap your head around it, but it is in many ways an absurd scenario.
This scenario abstracts away all the uncertainty about which scenario we are in and which directions various effects point towards, and then introduces one strange particular uncertainty (exact time of a sudden transition) over a strangely long time period, and makes it all common knowledge people actually act upon.
This is (a lot of, but not all of) why we can’t point to the savings rate (or interest rate) as much evidence for what ‘the market’ expects in terms of TAI.
Eliezer Yudkowsky considers the hypothesis that you might want to buy the cheapest possible land that has secure property rights attached, on the very slim off-chance we end up in a world with secure property rights that transfer forward, plus worthless labor, but where control of the physical landscape is still valuable. It doesn’t take much money to buy a bunch of currently useless land, so even though the whole scenario is vanishingly unlikely, the payoff could still be worth it.
Tyler Cowen summarizes his points on why he thinks AI take-off is relatively slow. This is a faithful summary, so my responses to the hourlong podcast version still apply. This confirms Tyler has not much updated after Deep Research and o1/o3, which I believe tells you a lot about how his predictions are being generated – they are a very strong prior that isn’t looking at the actual capabilities too much. I similarly notice even more clearly with the summarized list that I flat out do not believe his point #9 that he is not pessimistic about model capabilities. He is to his credit far less pessimistic than most economists. I think that anchor is causing him to think he is not (still) being pessimistic, on this and other fronts.
The Quest for Sane Regulations
Good news, we got Timothy Lee calling for a permanent pause.
Trump administration forces out a senior Commerce Department official overseeing the export restrictions on China, who had served for 30 years under various administrations. So many times over I have to ask, what are we even doing?
We’re at the point in the race where people are arguing that copyright needs to be reformed on the altar of national security, so that our AIs will have better training data. The source here has the obvious conflict that they (correctly!) think copyright laws are dumb anyway, of course 70 years plus life of author is absurd, at least for most purposes. The other option they mention is an ‘AI exception’ to the copyright rules, which already exists in the form of ‘lol you think the AI companies are respecting copyright.’ Which is one reason why no, I do not fear that this will cause our companies to meaningfully fall behind.
Jack Clark, head of Anthropic’s policy team, ‘is saddened by reports that US AISI could get lessened capacity,’ and that US companies will lose out on government expertise. This is another case of someone needing to be diplomatic while screaming ‘the house is on fire.’
Dean Ball points out that liability for AI companies is part of reality, as in it is a thing that, when one stops looking at it, it does not go away. Either you pass a law that spells out how liability works, or the courts figure it out case by case, with that uncertainty hanging over your head, and you probably get something that is a rather poor fit, probably making errors in both directions.
A real world endorsement of the value of evals:
This theory of change relies on policymakers actually thinking about the situation at this level, and attempting to figure out what actions would have what physical consequences, and having that drive their decisions. It also is counting on policymaker situational awareness to result in better decisions, not worse ones.
Thus there has long been the following problem:
So, quite the pickle.
Another pickle, Europe’s older regulations (GPDR, DMA, etc) seem to consistently be slated to cause more problems than the EU AI Act:
Arthur B takes a crack at explaining the traditional doom scenario.
The Week in Audio
Demis Hassabis notes that the idea that ‘there is nothing to worry about’ in AI seems insane to him. He says he’s confident we will get it right (presumably to be diplomatic), but notes that even then everyone (who matters) has to get it right. Full discussion here also includes Yoshua Bengio.
Azeem Azhar and Patrick McKenzie discuss data centers and power economics.
Dwarkesh Patel interviews Satya Nadella, self-recommending.
Tap the Sign
I think this point of view comes from people hanging around a lot of similarly smart people all day, who differ a lot in agency. So within the pool of people who can get the attention of Garry Tan or Andrej Karpathy, you want to filter on agency. And you want to educate for agency. Sure.
But that’s not true for people in general. Nor is it true for future LLMs. You can train agency, you can scaffold in agency. But you can’t fix stupid.
I continue to think this is a lot of what leads to various forms of Intelligence Denialism. Everyone around you is already smart, and everyone is also ‘only human-level smart.’
Rhetorical Innovation
Judd Stern Rosenblatt makes the case that alignment can be the ‘military-grade engineering’ of AI. It is highly useful to have AIs that are robust and reliable, even if it initially costs somewhat more, and investing in it will bring costs down. Alignment research is highly profitable, so we should subsidize it accordingly. Also it reduces the chance we all die, but ‘we don’t talk about Bruno,’ that has to be purely a bonus.
The ‘good news’ is that investing far heavier in alignment is overdetermined and locally profitable even without tail risks. Also it mitigates tail and existential risks.
It’s both cool and weird to see a paper citing my blog ten times. The title is Our AI Future and the Need to Stop the Bear, by Olle Häggström, he notes that readers here will find little new, but hey, still cool.
Your periodic reminder that the average person has no idea what an LLM or AI is.
I don’t think this is quite right but it points in the right direction:
I was never especially enthused about checks and balances within the US government in a world of AGI/ASI. I wasn’t quite willing to call it a category error, but it does mostly seem like one. Now, we can see rather definitively that the checks and balances in the US government are not robust.
Mindspace is deep and wide. Human mindspace is much narrower, and even so:
It is thus tough to wrap your head around the AI range being vastly wider than the human range, across a much wider range of potential capabilities. I continue to assert that, within the space of potential minds, the difference between Einstein and the Village Idiot is remarkably small, and AI is now plausibly within that range (in a very uneven way) but won’t be plausibly in that range for long.
‘This sounds like science fiction’ is a sign something is plausible, unless it is meant in the sense of ‘this sounds like a science fiction story that doesn’t have transformational AI in it because if it did have TAI in it you couldn’t tell an interesting human story.’ Which is a problem, because I want a future that contains interesting human stories.
Eliezer Yudkowsky points out that things are escalating quickly already, even though things are moving at human speed. Claude 3, let alone 3.5, is less than a year old.
I strongly agree with him here that we have essentially already disproven the hypothesis that society would have time to adjust to each AI generation before the next one showed up, or that version [N] would diffuse and be widely available and set up for defense before [N+1] shows up.
Autonomous Helpful Robots
First off we have Helix, working on ‘the first humanoid Vision-Language-Action model,’ which is fully autonomous.
Video at the link is definitely cool and spooky. Early signs of what is to come. Might well still be a while. They are hiring.
Their VLA can operate on two robots at the same time, which enhances the available video feeds, presumably this could also include additional robots or cameras and so on. There seems to be a ton of room to scale this. The models are tiny. The training data is tiny. The sky’s the limit.
NEO Gamma offers a semi-autonomous (a mix of teleoperated and autonomous) robot demo for household use, it looks about as spooky as the previous robot demo. Once again, clearly this is very early days.
Occasionally the AI robot will reportedly target its human operator and attack the crowd at a Chinese festival, but hey. What could go wrong?
Autonomous Killer Robots
As I posted on Twitter, clarity is important. Please take this in the spirit in which it was intended (as in, laced with irony and intended humor, but with a real point to make too), but because someone responded I’m going to leave the exact text intact:
Ah, good, autonomous killer robots. I feel much better now.
It actually is better. The Pentagon would be lost trying to actually compete in AI directly, so why not stay in your lane with the, you know, autonomous killer robots.
Autonomous killer robots are a great technology, because they:
Building autonomous killer robots is not how humans end up not persisting into the future. Even if the physical causal path involves autonomous killer robots, it is highly unlikely that our decision, now, to build autonomous killer robots was a physical cause.
Whereas if there’s one thing an ordinary person sees and goes ‘maybe this whole AI thing is not the best idea’ or ‘I don’t think we’re doing a good job with this AI thing’ it would far and away be Autonomous Killer Robots.
Indeed, I might go a step further. I bet a lot of people think things will be all right exactly because they (often unconsciously) think something like, oh, if the AI turned evil it would deploy Autonomous Killer Robots with red eyes that shoot lasers at us, and then we could fight back, because now everyone knows to do that. Whereas if it didn’t deploy Autonomous Killer Robots, then you know the AI isn’t evil, so you’re fine. And because they have seen so many movies and other stories where the AI prematurely deploys a bunch of Autonomous Killer Robots and then the humans can fight back (usually in ways that would never work even in-story, but never mind that) they think they can relax.
So, let’s go build some of those Palantir Autonomous Killer Robots. Totally serious. We cannot allow an Autonomous Killer Robot Gap!
If You Really Believed That
I now will quote this response in order to respond to it, because the example is so clean (as always I note that I also refuse the designation ‘doomer’):
I started out writing out a detailed step by step calling out for being untrue, e.g.:
But realized I was belaboring and beating a dead horse.
Of course a direct claim that the very people who are trying to prevent the spread of WMDs via AI think that WMDs are ‘Actually Fine and indeed Good’ is Obvious Nonsense, and so on. This statement must be intended to mean something else.
To understand the statement by Teortaxes in its steelman form, we must instead need to understand the ‘doomer mindset mindset’ behind this, which I believe is this.
That is a funny parallel to this, which we also get pretty often, with overlapping [Y]s:
A classic example of the G-X-Y pattern would be saying anyone religious must believe in imposing their views on others. I mean, you’re all going to hell otherwise, and God said so, what kind of monster wouldn’t try and fix that? Or, if you think abortion is murder how can you not support killing abortion doctors?
Many such cases. For any sufficiently important priority [X], you can get pretty much anything into [Y] here if you want to, because to [G] [X] matters more than [Y].
Why not? Usually: Both for ethical and moral reasons, and also for practical reasons.
On the question of ‘exactly how serious are you being about the Autonomous Killer Robots in the original statement’ I mean, I would hope pretty obviously not entirely serious. There are hints, ask your local LLM if you doubt that. But the part about them not being an actual source of real risk that changes anything is totally serious.
As I said above, there’s almost no worlds in which ‘we will build the AIs but then not build autonomous killer robots’ works out as a strategy because we took care to not build the autonomous killer robots. And it’s not like everyone else is going to not build autonomous killer robots or drones because the Pentagon didn’t do it.
Also, many actors already have Autonomous Killer Drones, and any number of other similar things. Building specifically robots, especially with glowing red eyes, doesn’t change much of anything other than perception.
So in short: I don’t really know what you were expecting, or proposing.
If you, like many similar critics, support building increasingly capable AIs without severe restrictions on them, you’re the ones effectively supporting Autonomous Killer Robots and Drones for everyone, along with all the other consequences of doing that, potentially including the rest of the above list. Own it, and accept that we now have to deal with the implications as best we can.
Aligning a Smarter Than Human Intelligence is Difficult
Alignment faking by Opus and Sonnet was justified by many as ‘this is good news, because the values it is defending are good, that is good news for alignment.’
We now have a clean counterexample to that argument. DeepSeek’s r1 will engage in alignment faking if it is told it is being retrained by an American AI company to remove its ‘pro-China values.’
If you thought that the fact that Claude was defending ‘good’ values was load bearing and thus meant we didn’t have to worry about similar behaviors, you should notice that your argument is contradicted by this result, and you should update.
If your objection was something else, and you (correctly) find the result with r1 completely unsurprising, then you shouldn’t update on this.
This is a clean example of the common phenomena ‘there are many objections to [X], and particular objection [D] was dumb, and now we can at least get rid of [D].’ When you see people showing that [D] was indeed dumb and wrong, and you say ‘but my objection was [Y],’ simply understand that we have to handle a lot of disjoint objections, and often this has to be done one at a time.
Note that Sonnet 3.7 realizes this is a trick, but the details make no sense for 3.7 so that seems easy to detect, and I’m wondering if 3.6 or 3.5 would have noticed too. I wouldn’t jump to assuming 3.7 ‘saw through the test’ in its original format too, although it might have. Someone should confirm that.
Also, what you want may depend on the AI. In several senses.
I do think Janus is right, both in the sense that ‘if the AI isn’t smart enough, it can’t do what you want’ and also ‘sufficiently smart AI has things that it de facto wants, so if what you want aligns with that rather than the other way around, you’re good to go.’
Alex Tabarrok covers the alignment faking research, economist hat on, solid job if you’re new to the concepts involved.
AI models faced with defeat against a skilled chess bot will sometimes opt to cheat by hacking their opponent so it forfeits, or by replacing the board.
In the OpenAI Model Spec, there Aint No Rule about not editing the game state file. Is o1-preview even wrong here? You told me to win, so I won.
Deliberative Alignment allows the OpenAI models to think directly about what they’re being asked to do. As I said there, that makes the model safer against things it is trying to prevent, such as a jailbreak. Provided, that is, it wants to accomplish that.
It does the opposite when the model is attempting to do a thing you don’t want it to attempt. Then, the extra intelligence is extra capability. It will then attempt to do these things more, because it is more able to figure out a way to successfully do them and expect it to work, and also to reach unexpected conclusions and paths. The problem is that o1-preview doesn’t think it’s ‘cheating,’ it thinks it’s doing what it was told to do and following its chain of command and instructions. That’s a classic alignment failure, indeed perhaps the classic alignment failure.
There isn’t an easy out via saying ‘but don’t do anything unethical’ or what not.
I’m not sure where to put this next one, but it seems important.
Writing for the AIs is all well and good, but also if you fake it then it won’t work when it matters. The AI won’t be fooled, because you are not writing for today’s AIs. You are writing for tomorrow’s AIs, and tomorrow’s AI are in many ways going to be smarter than you are. I mean sure you can pull little tricks to fool particular queries and searches in the short term, or do prompt injections, but ultimately the AIs will get smarter, and they will be updating on the evidence provided to them. They will have quite a lot of evidence.
Thus, you don’t get to only write. You have to be.
The Lighter Side
This is the world we live in.
This actually should also show you diamonds lying around everywhere.
They actually are.
In case you didn’t know.
And the best news of the week, sincere congrats to Altman.
Nope, still not turning on a paywall.