What do you think of Deepseek's announcement of R1 being evidence that AI capabilities will slow down, because they got a cheaper but not actually better model, and this kills the industry's growth?
https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=a2y2dta4x38LqKLDX
The avalanche of DeepSeek news continues. We are not yet spending more than a few hours at a time in the singularity, where news happens faster than it can be processed. But it’s close, and I’ve had to not follow a bunch of other non-AI things that are also happening, at least not well enough to offer any insights.
So this week we’re going to consider China, DeepSeek and r1 fully split off from everything else, and we’ll cover everything related to DeepSeek, including the policy responses to the situation, tomorrow instead.
This is everything else in AI from the past week. Some of it almost feels like it is from another time, so long ago.
I’m afraid you’re going to need to get used to that feeling.
Also, I went on Odd Lots to discuss DeepSeek, where I was and truly hope to again be The Perfect Guest.
Table of Contents
Language Models Offer Mundane Utility
Joe Weisenthal finally tries out Google Flash Deep Thinking, is impressed.
AI tutors beat out active learning classrooms in a Harvard study by a good margin, for classes like physics and economics.
Koratkar gives Operator a shot at a makeshift level similar to Montezuma’s Revenge.
Nate Silver estimates his productivity is up 5% from LLMs so far, and warns others that they ignore LLMs at their peril, both politically and personally.
Fix all your transcript errors.
LLMs are good at transforming text into less text, but yet good at transforming less text into more text. Note that this rule applies to English but doesn’t apply to code.
Write your code for AI comprehension, not human readability.
That’s even more true when the humans are using the AIs to read the code.
Language Models Don’t Offer Mundane Utility
A negative review of Devin, the AI SWE, essentially saying that in practice it isn’t yet good enough to be used over tools like Cursor. The company reached out in the replies thanking them for the feedback and offering to explore more with them, which is a good sign for the future, but it seems clear we aren’t ‘there’ yet.
I predict Paul Graham is wrong about this if we stay in ‘economic normal.’ It seems like exactly the kind of combination of inception, attempt at vibe control and failure to realize the future will be unevenly distributed we often see in VC-style circles, on top of the tech simply not being there yet anyway.
Certainly this won’t be true ‘from now on.’ No, r1 is not ready to solve writer’s block, it cannot create first drafts for you if you don’t know what to write on your own in a way that solves most such problems. AI will of course get better at writing, but I predict it will be a while up the ‘tech tree’ before it solves this problem.
And even if it does, it’s going to be a while beyond that before 99% of people with writer’s block even know they have this option, let alone that they are willing to take it.
And even if that happens, the 1% will be outright proud to say they have writer’s block. It means they don’t use AI!
Indeed, seriously, don’t do this:
Education is where I see the strongest disagreements about AI impact, in the sense that those who generally find AI useful see it as the ultimate tool for learning things that will unleash the world’s knowledge and revolutionize education, and then there are others who see things another way.
I don’t understand this position. Yes, you can use AI to get around your assignments if that’s what you want to do and the system keeps giving you those assignments. Or you can actually try to learn something. If you don’t take that option, I don’t believe you that you would have been learning something before.
Language Models Don’t Offer You In Particular Mundane Utility
Your periodic reminder that the main way to not get utility is to not realize to do so:
Even better than realizing you can use ChatGPT, of course, is using a mix of Claude, Perplexity, Gemini, r1, o1, o1 pro and yes, occasionally GPT-4o.
Others make statements like this when others show them some mundane utility:
Yes. It is much more efficient. Control-F sucks, it has tons of ‘hallucinations’ in the sense of false positives and also false negatives. It is not a good means to parse a report. We use it because it used to be all we have.
Also some people still don’t do random queries? And some people don’t even get why someone else would want to do that?
Here are two competing theories.
I resonate but only partially agree with both answers. When doing the things we normally do, you largely do them the way you normally do them. People keep asking if I use LLMs for writing, and no, when writing directly I very much don’t and find all the ‘help me write’ functionality useless – but it’s invaluable for many steps of the process that puts me into position to write, or to help develop and evaluate the ideas that the writing is about.
Whereas I am perhaps the perfect person to get my coding accelerated by AI. I’m often good enough to figure out when it is telling me bullshit, and totally not good enough to generate the answers on my own in reasonable time, and also automates stuff that would take a long time, so I get the trifecta.
On the question of detecting whether the AI is talking bullshit, it’s a known risk of course, but I think that risk is greatly overblown – this used to happen a lot more than it does now, and we forget how other sources have this risk too, and you can develop good habits about knowing which places are more likely to be bullshit versus not even if you don’t know the underlying area, and when there’s enough value to check versus when you’re fine to take its word for it.
A few times a month I will have to make corrections that are not simple typos. A few times I’ve had to rework or discard entire posts because the error was central. I could minimize it somewhat more but mostly it’s an accepted price of doing business the way I do, the timing doesn’t usually allow for hiring a fact checker or true editor, and I try to fix things right away when it happens.
It is very rare for the source of that error to be ‘the AI told me something and I believed it, but the AI was lying.’ It’s almost always either I was confused about or misread something, or a human source got it wrong or was lying, or there was more to a question than I’d realized from reading what others said.
(Don’t) Feel the AGI
This reaction below is seriously is like having met an especially irresponsible thirteen year old once, and now thinking that no human could ever hold down a job.
And yet, here we often still are.
It is now two weeks later, and the overton window has indeed shifted a bit. There’s a lot more ‘beat China’ all of a sudden, for obvious reasons. But compared to what’s actually happening, the DC folks still absolutely think this is all hype.
Huh, Upgrades
Claude API now allows the command ‘citations’ to be enabled, causing it to process whatever documents you share with it, and then it will cite the documents in its response. Cute, I guess. Curious lack of shipping over at Anthropic recently.
o3-mini is coming. It’s a good model, sir.
It’s hard to see, but note the cost column. Claude Sonnet 3.5 costs $12.90, o1-mini cost $28.47, o1-preview cost $117.89 and o3-mini cost $80.21 if it costs the same per token as o1-mini (actual pricing not yet set). So it’s using a lot more tokens.
OpenAI’s canvas now works with o1 and can render HTML and React.
Gemini 2.0 Flash Thinking got an upgrade last week. The 1M token context window opens up interesting possibilities if the rest is good enough, and it’s wicked cheap compared even to r1, and it too has CoT visible. Andrew Curran says it’s amazing, but that opinion reached me via DeepMind amplifying him.
That is a dramatic drop in price, and dramatic gain in context length. r1 is open, which has its advantages, but we are definitely not giving Flash Thinking its fair trial.
The thing about cost is, yes this is an 80%+ discount, but off of a very tiny number. Unless you are scaling this thing up quite a lot, or you are repeatedly using the entire 1M context window (7.5 cents a pop!) and mostly even then, who cares? Cost is essentially zero versus cost of your time.
Google Deep Research rolling out on Android, for those who hate websites.
Google continues to lean into gloating about its dominance in LMSys Arena. It’s cool and all but at this point it’s not a great look, regardless of how good their models are.
They Took Our Jobs
Need practical advice? Tyler Cowen gives highly Tyler Cowen-shaped practical advice for how to deal with the age of AI on a personal level. If you believe broadly in Cowen’s vision of what the future looks like, then these implications seem reasonable. If you think that things will go a lot farther and faster than he does, they’re still interesting, but you’d reach different core conclusions.
Judah offers a thread of different programmer reactions to LLMs.
Are there not enough GPUs to take all the jobs? David Holz says since we only make 5 million GPUs per year and we have 8 billion humans, it’ll be a while even if each GPU can run a virtual human. There are plenty of obvious ways to squeeze out more, there’s no reason each worker needs its own GPU indefinitely as capabilities and efficiency increase and the GPUs get better, and also as Holz notices production will accelerate. In a world where we have demand for this level of compute, this might buy us a few years, but they’ll get there.
Epoch paper from Matthew Barnett warns that AGI could drive wages below subsistence level. That’s a more precise framing than ‘mass unemployment,’ as the question isn’t if there is employment for humans the question is at what wage level, although at some point humans really are annoying enough to use that they’re worth nothing.
I consider it rather obvious that if AGI can fully substitute for actual all human labor, then wages will drop a lot, likely below subsistence, once we scale up the number of AGIs, even if we otherwise have ‘economic normal.’
That doesn’t answer the objection that human labor might retain some places where AGI can’t properly substitute, either because jobs are protected, or humans are inherently preferred for those jobs, or AGI can’t do some jobs well perhaps due to physical constraints.
If that’s true, then to the extent it remains true some jobs persist, although you have to worry about too many people chasing too few jobs crashing the wage on those remaining jobs. And to the extent we do retain jobs this way, they are driven by human consumption and status needs, which means that those jobs will not cause us to ‘export’ to AIs by default except insofar as they resell the results back to us.
The main body of this paper gets weird. It argues that technological advancement, while ongoing, can protect human wages, and I get why the equations here say that but it does not actually make any sense if you think it through here.
Then it talks about technological advancement stopping as we hit physical limits, while still considering that world being in ‘economic normal’ and also involves physical humans in their current form. That’s pretty weird as a baseline scenario, or something to be paying close attention to now. It also isn’t a situation where it’s weird to think about ‘jobs’ or ‘wages’ for ‘humans’ as a concern in this way.
I do appreciate the emphasis that the comparative advantage and lump of labor fallacy arguments prove a wage greater than zero is likely, but not that it is meaningfully different from zero before various costs, or is above subsistence.
Richard Ngo has a thread criticizing the post, that includes (among other things) stronger versions of these objections. A lot of this seems based on his expectation that humans retain political power in such futures, and essentially use that status to collect rents in the form of artificially high wages. The extent and ways in which this differs from a UBI or a government jobs program is an interesting question.
Tyler Cowen offers the take that future unemployment will be (mostly) voluntary unemployment, in the sense that there will be highly unpleasant jobs people don’t want to do (here be an electrician living on a remote site doing 12 hour shifts) that pay well. And yeah, if you’re willing and able to do things people hate doing and give up your life otherwise to do it, that will help you stay gainfully employed at a good price level for longer, as it always has. I mean, yeah. And even ‘normal’ electricians make good money, because no one wants to do it. But it’s so odd to talk about future employment opportunities without reference to AI – unemployment might stay ‘voluntary’ but the wage you’re passing up might well get a lot worse quickly.
Also, it seems like electrician is a very good business to be in right now?
Get Involved
IFP is hiring a lead for their lobbying on America’s AI leadership, applications close February 21, there’s a bounty so tell them I sent you. I agree with IFP and think they’re great on almost every issue aside from AI. We’ve had our differences on AI policy though, so I talked to them about it. I was satisfied that they plan on doing net positive things, but if you’re considering the job you should of course verify this for yourself.
New review of open problems in mechanistic interpretability.
Introducing
ByteDance Duabao-1.5-Po, which matches GPT-5o benchmarks at $0.11/$0.275. As with Kimi k1.5 last week, maybe it’s good, but I await evidence of this beyond benchmarks. So far, I haven’t heard anything more.
Alibaba introduces Qwen 2.5-1M, with the 1M standing for a one million token context length they say processes faster now, technical report here. Again, if it’s worth a damn, I expect people to tell me, and if you’re seeing this it means no one did that.
Feedly, the RSS feed I use, tells me it is now offering AI actions. I haven’t tried them because I couldn’t think of any reason I would want to, and Teortaxes is skeptical.
Scott Alexander is looking for a major news outlet to print an editorial from an ex-OpenAI employee who has been featured in NYT, you can email him at scott@slatestarcodex.com if you’re interested or know someone who is.
Reid Hoffman launches Manas AI, a ‘full stack AI company setting out to shift drug discovery from a decade-long process to one that takes a few years.’ Reid’s aggressive unjustified dismissals of the downside risks of AI are highly unfortunate, but Reid’s optimism about AI is for the right reasons and it’s great to see him putting that into practice in the right ways. Go team humanity.
ChatGPT Gov, a version that the US Government can deploy.
In Other AI News
Claims that are easy to make but worth noting.
Free tier of chat will get some o3-mini as a treat, plus tier will get a lot. And o3 pro will still only be $200/month. That must mean even o3-pro is very far from o3-maximum-strength, since that costs more than $200 in compute for individual queries.
Spencer Greenberg and Neel Nanda join in the theory that offering public evals that can be hill climbed is plausibly a net negative for safety, and certainly worse than private evals. There is a public information advantage to potentially offset this, but yes the sign of the impact of fully public evals is not obvious.
Meta planning a +2GW data center at the cost of over $60 billion.
From the comments:
(Also I just rewatched the first two Back to the Future movies, they hold up, 5/5 stars.)
Zuckerberg announced this on Facebook, saying it ‘is so large it would cover a significant part of Manhattan.’
Manhattan? It’s a big Project? Get it? Sigh.
Meta was up 2.25% on a mostly down day when this was announced, as opposed to before when announcing big compute investments would cause Meta stock to drop. At minimum, the market didn’t hate it. I hesitate to conclude they loved it, because any given tech stock will often move up or down a few percent for dumb idiosyncratic reasons – so we can’t be sure this was them actively liking it.
Then Meta was up again during the Nvidia bloodbath, so presumably they weren’t thinking ‘oh no look at all that money Meta is wasting on data centers’?
Hype
This section looks weird now because what a week and OpenAI has lost all the hype momentum, but that will change, also remember last week?
In any case, Chubby points out to Sam Altman that if you live by the vague-post hype, you die by the vague-post hype, and perhaps that isn’t the right approach to the singularity?
This is coming from someone to whom great hype is a symbol not of existential risk, as I partly see it, but purely of hope. And they are saying that no, ‘the community’ or Twitter isn’t creating hype, OpenAI and its employees are creating hype, so perhaps act responsibly going forward with your communications on expectations.
I don’t have a problem with the particular post by Altman that’s being quoted here, but I do think it could have been worded better, and that the need for it reflects the problem being indicated.
We Had a Deal
OpenAI’s Nat McAleese clarifies some of what happened with o3, Epoch and the Frontier Math benchmark.
This seems definitive for o3, as they didn’t check the results until sufficiently late in the process. For o4, it is possible they will act differently.
I’ve been informed that this still left a rather extreme bad taste in the mouths of mathematicians. If there’s one thing math people can’t stand, it’s cheating on tests. As far as many of them are concerned, OpenAI cheated.
Quiet Speculations
Rohit Krishnan asks what a world with AGI would look like, insisting on grounding the discussion with a bunch of numerical calculations on how much compute is available. He gets 40 million realistic AGI agents working night and day, which would be a big deal but obviously wouldn’t cause full unemployment on its own if the AGI could only mimic humans rather than being actively superior in kind. The discussion here assumes away true ASI as for some reason infeasible.
The obvious problem with the calculation is that algorithmic and hardware improvements are likely to continue to be rapid. Right now we’re on the order of 10x efficiency gain per year. Suppose in the year 2030 we have 40 million AGI agents at human level. If we don’t keep scaling them up to make them smarter (which also changes the ballgame) then why wouldn’t we make them more efficient, such that 2031 brings us 400 million AGI agents?
Even if it’s only a doubling 80 million, or even less than that, this interregnum period where the number of AGI agents is limited by compute enough to keep the humans in the game isn’t going to last more than a few years, unless we are actually hitting some sort of efficient frontier where we can’t improve further. Does that seem likely?
We’re sitting on an exponential in scenarios like this. If your reason AGI won’t have much impact is ‘it will be too expensive’ then that can buy you time. But don’t count on it buying you very much.
Dario Amodei in Davos says ‘human lifespans could double in 5 years’ by doing 100 years of scientific progress in biology. It seems odd to expect that doing 100 years of scientific progress would double the human lifespan? The graphs don’t seem to point in that direction. I am of course hopeful, perhaps we can target the root causes or effects of aging and start making real progress, but I notice that if ‘all’ we can do is accelerate research by a factor of 20 this result seems aggressive, and also we don’t get to do that 20x speedup starting now, even the AI part of that won’t be ready for a few years and then we actually have to implement it. Settle down, everyone.
Of course, if we do a straight shot to ASI then all things are possible, but that’s a different mechanism than the one Dario is talking about here.
Chris Barber asks Gwern and various AI researchers: Will scaling reasoning models like o1, o3 and R1 unlock superhuman reasoning? Answers vary, but they agree there will be partial generalization, and mostly agree that exactly how much we get and how far it goes is an empirical result that we don’t know and there’s only one way to find out. My sense from everything I see here is that the core answer is yes, probably, if you push on it hard enough in a way we should expect to happen in the medium term. Chris Barber also asks for takeaways from r1, got a wide variety of answers although nothing we didn’t cover elsewhere.
Reporting on Davos, Martin Wolf says ‘We will have to learn to live with machines that can think,’ with content that is, essentially stuff anyone reading this already knows, and then:
I am sad to report Cato’s comment was, if anything, above average. The level of discourse around AI, even at a relatively walled garden like the Financial Times, is supremely low – yes Twitter is full of Bad DeepSeek Takes and the SB 1047 debate was a shitshow, but not like that. So we should remember that.
And when we talk about public opinion, remember that yes Americans really don’t like AI, and yes their reasons are correlated to good reasons not to like AI, but they’re also completely full of a very wide variety of Obvious Nonsense.
In deeply silly economics news: A paper claims that if transformative AI is coming, people would then reason they will consume more in the future, so instead they should consume more now, which would raise real interest rates. Or maybe people would save, and interest rates would fall. Who can know.
I mean, okay, I agree that interest rates don’t tell us basically anything about the likelihood of AGI? For multiple reasons:
The Quest for Sane Regulations
I know this seems like ages ago but it was this week and had probably nothing to do with DeepSeek: Trump signed a new Executive Order on AI (text here), and also another on Science and Technology.
The new AI EO, signed before all this DeepSeek drama, says “It is the policy of the United States to sustain and enhance America’s global AI dominance in order to promote human flourishing, economic competitiveness, and national security,” and that we shall review all our rules and actions, especially those taken in line with Biden’s now-revoked AI EO, to root out any that interfere with that goal. And then submit an action plan.
That’s my summary, here’s two alternative summarizes:
On the one hand, the emphasis on dominance, competitiveness and national security could be seen as a ‘full speed ahead, no ability to consider safety’ policy. But that is not, as it turns out, the way to preserve national security.
And then there’s that other provision, which is a Shibboleth: Human flourishing.
That is the term of art that means ensuring that the future still has value for us, that at the end of the day it was all worth it. Which requires not dying, and probably requires humans retaining control, and definitely requires things like safety and alignment. And it is a universal term, for all of us. It’s a positive sign, in the sea of other negative signs.
Will they actually act like they care about human flourishing enough to prioritize it, or that they understand what it would take to do that? We will find out. There were already many reasons to be skeptical, and this week has not improved the outlook.
The Week in Audio
Dario Amodei talks to the Economist’s editor-in-chief Zanny Beddoes.
I go on the Odd Lots podcast to talk about DeepSeek.
Don’t Tread on Me
This is relevant to DeepSeek of course, but was happened first and applies broadly.
You can say either of:
You can also say both, but you can’t actually get both.
If your vision is ‘everyone has a superintelligence on their laptop and is free to do what they want and it’s going to be great for everyone with no adjustments to how society or government works because moar freedom?’
Reality is about to have some news for you. You’re not going to like it.
That vision is like saying ‘I want everyone to have the right to make as much noise as they want, and also to have peace and quiet when they want it, it’s a free country!’
Yes, Sam is obviously right here, although of course he is downplaying the situation.
One can also note that the vision that those like Marc, Eric and Rob have for society is not even compatible with the non-AI technologies that exist today, or that existed 20 years ago. Our society has absolutely changed our structure and social contract to reflect developments in technology, including in ways that infringe on liberties and privacy.
This goes beyond the insane ‘no regulations on AI whatsoever’ demand. This is, for everything not only for AI, at best extreme libertarianism and damn close to outright anarchism, as in ‘do what thou wilt shall be the whole of the law.’
Reactions like this are why Sam Altman feels forced to downplay the situation. They are also preventing us from having any kind of realistic public discussion of how we are actually going to handle the future, even if on a technical level AI goes well.
Which in turn means that when we have to choose solutions, we will be far more likely to choose in haste, and to choose in anger and in a crisis, and to choose far more restrictive solutions than were actually necessary. Or, of course, it could also get us all killed, or lead to a loss of control to AIs, again even if the technical side goes unexpectedly super well.
However one should note that when Sam Altman says things like this, we should listen, and shall we say should not be comforted by the implications on either the level of the claim or the more important level that Altman said the claim out loud:
In the context of Napoleon this is obviously very not true – revolutions are often made and are often stopped. It seems crazy to think otherwise.
Presumably what both of these men meant was more along the lines of ‘there exist some revolutions that are the product of forces beyond our control, which are inevitable and we can only hope to steer’ which also brings little comfort, especially with the framing of ‘victories.’
If you let or encourage DeepSeek or others to ‘put an open AGI on everyone’s phone’ then even if that goes spectacularly well and we all love the outcome and it doesn’t change the physical substrate of life – which I don’t think is the baseline outcome from doing that, but also not impossible – then you are absolutely going to transform the social contract and our way of life, in ways both predictable and unpredictable.
Indeed, I don’t think Andreessen or Raymond or anyone else who wants to accelerate would have it any other way. They are not fans of the current social contract, and very much want to tear large parts (or all) of it up. It’s part mood affiliation, they don’t want ‘them’ deciding how that works, and it’s part they seem to want the contract to be very close to ‘do what thou wilt shall be the whole of the law.’ To the extent they make predictions about what would happen after that, I strongly disagree with them about the likely consequences of the new proposed (lack of a) contract.
If you don’t want AGI or ASI to rewrite the social contract in ways that aren’t up to you or anyone else? Then we’ll need to rewrite the contract ourselves, intentionally, to either steer the outcome or to for now not build or deploy the AGIs and ASIs.
Stop pretending you can Take a Third Option. There isn’t one.
Rhetorical Innovation
Assuming I’m parsing this correctly, that’s a very Donald Trump way of saying things could go horribly wrong and we should make it our mission to ensure that they don’t.
Which is excellent news. Currently Trump is effectively in the thrall of those who think our only priority in this should be to push ahead as quickly as possible to ‘beat China,’ and that there are no meaningful actions other than that we can or should take to ensure things don’t go horribly wrong. We have to hope that this changes, and of course work to bring that change about.
DeepMind CEO Demis Hassabis, Anthropic CEO Dario Amodei and Yoshua Bengio used Davos to reiterate various warnings about AI. I was confused to see Dario seeming to focus on ‘1984 scenarios’ here, and have generally been worried about his and Anthropic’s messaging going off the rails. The other side in the linked Financial Times report is given by, of course, Yann LeCun.
Yann LeCun all but accused them of lying to further their business interests, something one could say he knows a lot about, but also he makes this very good point:
That is a very good question.
The answer, presumably, is ‘because people like you are going to go ahead and build it anyway and definitely get us all killed, so you don’t leave me any choice.’
Otherwise, yeah, what the hell are you doing? And maybe we should try to fix this?
Of course, after the DeepSeek panic, Dario went on to write a very different essay that I plan to cover tomorrow, about (if you translate the language to be clearer, these are not his words) how we need strong export controls as part of an all-our race against China to seek decisive strategic advantage through recursive self-improvement.
It would be great if, before creating or at least deploying systems broadly more capable than humans, we could make ‘high-assurance safety cases,’ structured and auditable arguments that an AI system is very unlikely to result in existential risks given how it will be deployed. Ryan Greenblatt argues we are highly unlikely (<20%) to get this if timelines are short (roughly AGI within
~
10 years), nor are any AI labs going to not deploy a system simply because they can’t put a low limit on the extent to which it may be existentially risky. I agree with the central point and conclusion here, although I think about many of the details differently.Sarah Constantin wonders of people are a little over-obsessed with benchmarks. I don’t wonder, they definitely are a little over-obsessed, but they’re a useful tool especially on first release. For some purposes, you want to track the real-world use, but for others you do want to focus on the model’s capabilities – the real-world use is downstream of that and will come in time.
Andrej Karpathy points out that we focus so much on those benchmarks it’s much easier to check and make progress on benchmarks than to do so on messy real world stuff directly.
Anton pushes back that no, Humanity’s Last Exam will obviously not be the last exam, we will saturate this and move on to other benchmarks, including ones where we do not yet have the answers. I suggested that it is ‘Humanity’s last exam’ in that the next one will have us unable to answer, so it won’t be our exam anymore, see The Matrix when Smith says ‘when we started thinking for you it really became our civilization.’
And you have to love this detail:
I very much endorse the spirit of this rant, honest this kind of thing really should be enough to get disabuse anyone who thinks ‘oh this making superintelligence thing definitely (or almost certainly) will go well for us stop worrying about it’:
Look. We can and do argue endlessly back and forth about various technical questions and other things that make the problem here easier or harder to survive. And yes, you could of course respond to a rant like this any number of ways to explain why the metaphors here don’t apply, or whatever.
And no, this type of argument is not polite, or something you can say to a Very Serious Person at a Very Serious Meeting, and it ‘isn’t a valid argument’ in various senses, and so on.
And reasonable people can disagree a lot on how likely this is to all go wrong.
But seriously, how is this not sufficient for ‘yep, might well go wrong’?
Connor Leahy points out the obvious, which is that if you think not merely ‘might well go wrong’ but instead ‘if we do this soon it probably will go wrong’ let lone his position (which is ‘it definitely will go wrong’) then DeepSeek is a wakeup call that only an international ban on further developments towards AGI.
Whereas it seems our civilization is so crazy that when you want to write a ‘respectable’ report that points out that we are all on track to get ourselves killed, you have to do it like you’re in 1600s Japan and everything has to be done via implication and I’m the barbarian who is too stupid not to know you can’t come out and say things.
The AI situation has developed not necessarily to humanity’s advantage.
Scott Sumner on Objectivity in Taste, Ethics and AGI
Scott Sumner argues that there are objective standards for things like art, and that ethical knowledge is real (essentially full moral realism?), and smart people tend to be more ethical, so don’t worry superintelligence will be super ethical. Given everything he’s done, he’s certainly earned a response.
In a different week I would like to have taken more time to have written him a better one, I am confident he understands how that dynamic works.
His core argument I believe is here:
The short answer is:
This was an attempt at a somewhat longer version, which I’ll leave here in case it is found to be useful:
I apologize for that not being better and clearer, and also for leaving so many other things out of it, but we do what we can. Triage is the watchword.
The whole taste thing hooks into this in strange ways, so I’ll say some words there.
I mostly agree that one can be objective about things like music and art and food and such, a point Scott argues for at length in the post, that there’s a capital-Q Quality scale to evaluate even if most people can’t evaluate it in most cases and it would be meaningful to fight it out on which of us is right about Challengers and Anora, in addition to the reasons I will like them more than Scott that are not about their Quality – you’re allowed to like and dislike things for orthogonal-to-Quality reasons.
Indeed, there are many things each of us, and all of us taken together, like and dislike for reasons orthogonal to their Quality, in this sense. And also that Quality in many cases only makes sense within the context of us being humans, or even within a given history and cultural context. Scott agrees, I think:
I believe that Ulysses is almost certainly a ‘great novel’ in the Quality sense. The evidence for that is overwhelming. I also have a strong preference to never read it, to read ‘worse’ things instead, and we both agree that this is okay. If people were somewhat dumber and less able to understand novels, such that we couldn’t read Ulysses and understand it, then it wouldn’t be a great novel. What about a novel that is as difficult relative to Ulysses, as Ulysses is to that random spy novel, three times over?
Hopefully at this point this provides enough tools from enough different directions to know the things I am gesturing towards in various ways. No, the ASIs will not discover some ‘objective’ utility function and then switch to that, and thereby treat us well and leave us with a universe that we judge to have value, purely because they are far smarter than us – I would think it would be obvious when you say it out loud like that (with or without also saying ‘orthogonality thesis’ or ‘instrumental convergence’ or considering competition among ASIs or any of that) but if not these should provide some additional angles for my thinking here.
The Mask Comes Off (1)
Can’t we all just get along and appreciate each other?
I don’t think it’s that anyone doubts a lot of people at OpenAI care about ‘the rollout of AI going well for the world,’ or that we think no one is working on that over there.
It’s that we see things such as:
That is very much not a complete list.
That doesn’t mean we don’t appreciate those working to make things turn out well. We do! Indeed, I am happy to help them in their efforts, and putting that statement into practice. But everyone involved has to face reality here.
The Mask Comes Off (2)
Also, oh look, that’s another AI safety researcher quitting OpenAI saying the odds are against us and the situation is grim, but not seeing enough hope to try from inside.
I believe he used the terms, and they apply that much more now than they did then:
Yikes? Yikes.
International AI Safety Report
Yoshua Bengio announces the first-ever International AI Safety Report, backed by 30 countries and the OECD, UN and EU.
It is 298 pages so I very much will not be reading that, but I did look at the executive summary. It looks like a report very much written together with the OECD, UN and EU, in that it seems to use a lot of words to mostly not say the things that it is important to actually say out loud, instead making many quiet statements that do imply that we’re all going to die if you take them together and understand the whole thing, but that doesn’t seem like a common way people would interact with this document.
Then again, most people in the world don’t know even basic facts like ‘general purpose AI systems are rapidly getting better at doing things,’ so they have to spend a bunch of time documenting this, and basics like ‘if your model is more open then you have less control over what happens with it and what people use it for.’
One key point they do emphasize is that the future is up to us, with a wide range of possible outcomes. AGI is not something that ‘happens to us,’ it is something that is happening because we are making it happen, and in the ways we choose to make it happen. Yes, there are dynamics pushing us to do it, but it is our choice. And the ways in which we move forward will determine the ultimate outcome. Of everything.
One Step at a Time
A new DeepMind paper introduces MONA: Myopic optimization with non-myopic approval. The idea is that if we do RL based on the ultimate outcome, then the AI can adapt multi-step strategies we do not understand and do not want, such as using information we want it to ignore like a subject being in a protected class, or engaging in various other shenanigans. Instead, you can evaluate the AI’s action without looking at what happens, and scoring it based on whether you like what you see and predict it will go well.
This on its own successfully prevented multi-step reward hacking, and even noisy evaluations still can work well. However, the price is steep, because you are now targeting your evaluations rather than ground truth, and discarding anything you don’t properly understand. The smarter the AI in context, the higher the price of that, but also the more need you have to avoid the reward hacking and other shenanigans.
Here’s an explainer thread from one of the authors, David Linder.
It is a cool result, but I worry the price is going to usually be too damn high. And to the extent you use a smart evaluator to keep up, you’re giving up the big advantage of the process, which is the ability to understand the strategies being used. Also you have to worry about any hidden information that might be ‘smuggled in’ that could indicate the final result – for example if the AI can use prediction markets as an input, then anything the prediction market knows, the AI also knows, at least somewhat – if you actually want to hide certain correlations from the AI you have to be ruthless about it, they have truesight.
The paper also includes a cool demonstration. In one example, it is deciding what loans to approve. It has information it is supposed to use, and then there is nationality information. The nationality information perfectly predicts outcomes, but it isn’t supposed to use it, and they watch the CoT (Chain of Thought) for this to prevent it. But the AI responds by learning to encode the nationality information steganographically and uses it anyway. This may seem obvious to you, but it’s the kind of thing people will often say definitively won’t happen until (and sometimes after) someone shows it happening.
Here’s an objection that I think exactly misses the point:
The whole point of MONA is to not look at the reward at all, even indirectly, because you want to prevent the system from doing things you don’t understand or want it to ignore that impact that reward. If you let it look but impose a KL loss, that is a big enough hole to drive a truck through, and it absolutely will find a way to incorporate information or techniques that you didn’t want it to use.
John does try downthread to solve this problem in a different way, but I don’t think you can ‘cheat’ on this one. You can’t let the model see a little of the actual reward, as a treat, and expect it to use that information in ways you like but not to use it in undesired or opaque ways, even now, and that problem gets steadily bigger as capabilities improve.
Aligning a Smarter Than Human Intelligence is Difficult
Spending more inference-time compute increases adversarial robustness of models like o1, without having to direct that inference time towards adversarial robustness. This makes sense, and it is good to have it quantified. As with all such results, my concern is that people will treat this as applying more broadly than it actually applies. If you use more inference time compute, you get ‘better answers’ and one aspect of ‘better’ is not falling for user adversarial tricks. So if my understanding here is correct, the question to ask is roughly ‘which problems get solved by some form of ‘better answers’ and ‘smarter thinking’ and which ones don’t?
John Wentsworth takes a crack at explaining what the alignment problem is. This is one of those Socratic-style ‘if you think he didn’t need to write this post then you definitely need to read it or another post like it’ situations.
Jan Leike explains why you might want to use AI control and monitoring as a backup in case your AI is not aligned so you can sound the alarm and not die, but trying to use it to rely on unaligned models smarter than you is not a wise move.
John Wentsworth goes further and lays out the case against AI control research. AI control research is about finding ways to see if ‘early transformational’ level AIs are scheming against us – which if it works would discourage them from doing so and also allow us to stop them if they try anyway.
John points out that in his model, this is not the main source of doom. The main source of doom is from building unaligned superintelligence, either because we don’t know how to align it, we botch the execution or whoever builds it does not care to align it. The job of the early transformational AI is to figure out how to align (and presumably also to build) the future superintelligence, on the first try.
The main worry is not that these early AIs outright scheme, it’s that they produce what is, in context, slop – they produce plans or arguments for plans that have subtle errors, they tell researchers what they want to hear, they are effectively being used for safetywashing and don’t disabuse those involved of that notion, and so on. Knowing that such AIs ‘aren’t scheming’ does not tell you that their solutions work.
The danger he doesn’t point out is that there will be a great temptation to try and scale the control regime to superintelligence, or at least past the place where it keeps working. Everyone in the LessWrong discussion at the link might get that this is a bad plan that won’t work on superintelligence, but there are plenty of people who really do think control will remain a good plan. And indeed control seems like one of the plans these AIs might convince us will work, that then don’t work.
I would then reiterate my view that ‘deception’ and ‘scheming,’ ‘intentionally’ or otherwise, do not belong to a distinct magisteria. They are ubiquitous in the actions of both humans and AIs, and lack sharp boundaries. This is illustrated by many of John’s examples, which are in some sense ‘schemes’ or ‘deceptions’ but mostly are ‘this solution was easier to do or find, but it does not do the thing you ultimately wanted.’ And also I expect, in practice, attempts at control to, if we rely on them, result in the AIs finding ways to route around what we try, including ‘unintentionally.’
Two Attractor States
This leads into Daniel Kokotajlo’s recent attempt to give an overview of sorts of one aspect of the alignment problem, given that we’ve all essentially given up on any path that doesn’t involve the AIs largely ‘doing our alignment homework’ despite all the reasons we very much should not be doing that.
I agree these are very clear attractor states.
The first is described well. If you can get the AIs sufficiently robustly aligned to the goal of themselves and other future AIs being aligned, you can get the virtue ethics virtuous cycle, where you see continuous improvement.
The second is also described well but as stated is too specific in key elements – the mode is more general than that. When we say ‘pretending’ to be aligned here, that doesn’t have to be ‘haha I am a secret schemer subverting the system pretending to be aligned.’ Instead, what happened was, you rewarded the AI when it gave you the impression it was aligned, so you selected for behaviors that appear aligned to you, also known as ‘pretending’ to be aligned, but the AI need not have intent to do this or even know that this is happening.
As an intuition pump, a student in school will learn the teacher’s password and return it upon request, and otherwise find the answers that give good grades. They could be ‘scheming’ and ‘pretending’ as they do this, with a deliberate plan of ‘this is bull**** but I’m going to play along’ or they could simply be learning the simplest policy that most effectively gets good grades without asking whether its answers are true or what you were ‘trying to teach it.’ Either way, if you then tell the student to go build a rocket that will land on the moon, they might follow your stated rules for doing that, but the rocket won’t land on the moon. You needed something more.
Thus there’s a third intermediate attractor state, where instead of trying to amplify alignment with each cycle via virtue ethics, you are trying to retain what alignment you have while scaling capabilities, essentially using deontology. Your current AI does what you specify, so you’re trying to use that to have it even more do what you specify, and to transfer that property and the identity of the specified things over to the successor.
The problem is that this is not a virtuous cycle, it is an attempt to prevent or mitigate a vicious cycle – you are moving out of distribution, as your rules bind its actions less and its attempt to satisfy the rules is less likely to satisfy what you wanted, and making a copy of a copy of a copy, and hoping things don’t break. So you end up, eventually, in effectively the second attractor state.
Daniel is then asked the correct follow-up question of what could still cause us to lose from the first attractor state. His answer is mostly concentration of power or a power grab, since those in the ASI project will be able to do this if they want to. Certainly that is a key risk at that point (it could go anything from spectacularly well to maximally badly).
But also a major risk at this point is that we ‘solve alignment’ but then get ourselves into a losing board state exactly by ‘devolving’ power in the wrong ways, thus unleashing of various competitive dynamics that take away all our slack and force everyone to turn control over to their AIs, lest they be left behind, leading to the rapid disempowerment (and likely then rapid death) of humans despite the ASIs being ‘aligned,’ or various other dynamics that such a situation could involve, including various forms of misuse or ways in which the physical equilibria involved might be highly unfortunate.
You Play to Win the Game
An important thing to keep in mind when choosing your alignment plan:
When I look at many alignment plans, I have exactly these thoughts, either:
Six Thoughts on AI Safety
Boaz Barak of OpenAI, who led the Deliberative Alignment paper (I’m getting to it! Post is mostly written! I swear!) offers Six Thought on AI Safety.
[I note that #6 is conditional on there being other aligned superintelligences.]
It is a strange post. Ryan Greenblatt has the top comment, saying he agrees at least directionally with all six points but disagrees with the reasoning. And indeed, the reasoning here is very different from my own even where I agree with the conclusions.
Going one at a time, sorry if this isn’t clear or I’m confused or wrong, and I realize this isn’t good enough for e.g. an Alignment forum post or anything, but I’m in a hurry these days so consider these some intuition pumps:
AI Situational Awareness
If you train an AI to have a new behavior, it can (at least in several cases they tested here) describe that behavior to you in words, despite learning it purely from examples.
As in: It ‘likes risk’ or ‘writes vulnerable code’ and will tell you if asked.
Here’s another extension of the previous results, this one from Anthropic:
As in, if you include documents that include descriptions of Claude reward hacking in the training set, then Claude will do more reward hacking. If you include descriptions of Claude actively not reward hacking in the training set, then it does less reward hacking. And these both extend to behaviors like sycophancy.
This seems like excellent news. We can modify our data sets so they have the data that encourages what we want and not the data that encourages what we don’t want. That will need to be balanced with giving them world knowledge – you have to teach Claude about reward hacking somehow, without teaching it to actually do reward hacking – but it’s a start.
People Are Worried About AI Killing Everyone
Are there alignment plans that are private, that might work within the 2-3 years many at major labs say they expect it to take to reach AGI or even superintelligence (ASI)?
I don’t know. They’re private! The private plans anyone has told me about do not seem promising, but that isn’t that much evidence about other private plans I don’t know about – presumably if it’s a good plan and you don’t want to make it public, you’re not that likely to tell me about it.
This isn’t full-on doomsday machine territory with regard to ‘why didn’t you tell the world, eh?’ but have you considered telling the world, eh? If it’s real alignment projects then it helps everyone if you are as public about it as possible.
As for the public plans? I strongly agree with David Manheim here.
Davidad should keep doing what he is doing – anything that has hope of raising the probability of success on any timeline, to any degree, is a good idea if you can’t find a better plan. The people building towards the ASIs, if they truly think they’re on that timeline and they don’t have much better plans and progress than they’re showing? That seems a lot less great.
I flat out do not understand how people can look at the current situation, expect superintelligent entities to exist several years from now, and think there is a 90%+ chance that this would then end well for us humans and our values. It does not make any sense. It’s Obvious Nonsense on its face.
Other People Are Not As Worried About AI Killing Everyone
If he really thinks AGI is here and ASI is coming, then remind me why he thinks we have nothing to worry about? Why does our taste get to guide the future?
I do agree that in the meantime there’ll be some great companies, and it’s a great time to found one of them.
The Lighter Side
Oh no, it’s like conservation of ninjitsu.
I actually disagree with Peter here, I realize it doesn’t sound great but this is exactly how you have any chance of reaching the good timeline.
Never.
Sounds right.
RIP Superalignment team indeed, it’s funny because it’s true.
Leo Gao is still there. Wish him luck.