Christoph Heilig tests 18 different OpenAI models including GPT-5.4, finds all of them rate some forms of pseudo-literary nonsense generations higher than coherent prose in every case, and this is not getting better over time.
I couldn't find examples of the highly rated poems. Is this analogical to those image generators that created the dog-shaped monstrosities under the assumption that "more eyes = more dog"?
If AI makes everyone twice as productive, one possibility is that everyone works half as hard. Another possibility is everyone works twice as hard, on top of being twice as productive, because half of people are about to get fired, or people see that you’ve completed your other work faster and suddenly give you more tasks.
This is the usual backward-bending labor supply curve, only typically we see it from the perspective of "what happens if we start paying people more", but this time we see it from the opposite end.
Before the age of massive unemployment and starvation, there will be the age of massive voluntary overtime and slavery as people try to avoid falling into the former. Say goodbye to 40 hour workweeks, but not in the sense that Keynes predicted.
The standard cope for "what happens if 1 person is able to do the work of 100 using an AI?" is "try to be that 1 person, and you can get super rich". But it could be more like "what happens if 1 person is able to do the work of 100, but there are 3 out of the 100 people who are able to do that?", and suddenly you may find yourself competing not just for productivity but also for low cost and voluntary overtime.
The major technical advances this week were in agentic coding, as covered yesterday.
The major non-DoW political and alignment developments will be covered tomorrow.
The DoW vs. Anthropic trial continues. Judge Lin was very not happy with the government’s case, which makes sense since the government has no case and was arguing a variety of Obvious Nonsense. The question now is how much preliminary relief Anthropic is entitled to. Assuming we find that out this week, I plan to cover that on Monday.
Beyond that, we have new iterations of questions we’ve dealt with time and again. The debate on jobs gets another cycle. Anthropic asked over 80,000 people what they think about AI, and has published those findings, nothing shocking but interesting throughout.
OpenAI is raising money again, although the terms raise some eyebrows. Elon Musk is announcing a grand chip project, but it was already kind of announced and it’s not like we should believe him when he says such things.
I used this lull to drop a giant response to Open Socrates, which is technically a book review but uses that as a taking off point to outline a distinct philosophy and approach, or at least is trying to do that, in ways that yes are highly relevant to life and also to AI. That doesn’t mean you should read it in its current form, it is very long, but yes there is (I like to think) a bunch of gold within it. Long term, the goal is to find a much better way to give out the gold without the fun but inessential other parts, and take the time to write a short letter.
Table of Contents
Language Models Offer Mundane Utility
Have Claude be your Dangerous Professional and explain why your insurance policy does cover that pipe after all.
Stop Indian trains from hitting elephants, perhaps?
Be an all-purpose Dangerous Professional.
How does use change as users skill up? Longer term Claude users iterate more often and more carefully, and hand over full autonomy less, but the changes are small.
There is a lot more in the full post, although nothing was too surprising.
Oklahoma Supreme Court says use all the AI you like, there is no need to disclose it, but you are still fully responsible for the contents. This is The Way.
Refine Your Paper
Grumpy (but usually correct) Economist John Cochrane is impressed by the tool refine for getting comments on academic articles, seeing it being on the high end of academic commenters on his work.
Shruti Rajagopalan calls it the most granular feedback she has ever recieved, and it was consistently correct and will be a key part of her future paper reviews. She affirms it was a lot more granular than GPT-5.4 Pro or other top options.
Arnold Kling points out that yes the output is impressive but that in his test Claude Opus 4.6 can do about as well out of the box on Cochrane’s work. If you’re doing serious such work it’s worth asking every source you can, and then comparing outputs.
The hype over Refine seems a little carried away.
I do expect paper replications and evaluations to get very strong over time. And yes, if you want to you can use Refine to judge the quality of all past papers, but there will be major errors in those evaluations even in past papers, where you are in a non-adversarial evaluation environment. For future papers, expect this to break down further.
Nor would I expect this to make good researchers less valuable until we reach a very high threshold of ability to replicate the entire process. A great researcher will be able to iterate quickly, find the right questions and get, select and synthesize the right results. When Tyler Cowen suggests that ‘look at this data set and tell me what is interesting’ will soon suffice to extract the value of a potential paper, then you are dangerously close to being AI complete.
Language Models Don’t Offer Mundane Utility
Christoph Heilig tests 18 different OpenAI models including GPT-5.4, finds all of them rate some forms of pseudo-literary nonsense generations higher than coherent prose in every case, and this is not getting better over time. You cannot use such a model as an evaluator without a way to correct for this, and one worries about its own generations.
You’re reasonably safe if you’re not in an adversarial situation, and the text being evaluated is being generated for human consumption. But if you’re in an anti-inductive situation, or even a straightforward adversarial one, you’re cooked.
Google keeps embedding the Gemini symbol into its services but no one can figure out how to actually use it and have it connect to the context it needs.
If I want a properly labeled YouTube transcript as a Google document, as I often do, you would think I would ask Gemini since these are all Google services. But no, I ask Claude Code to use a skill I built, and will keep doing so until someone at Google explains to anyone how their products can do such things. They announce things like ‘reimagining content creation with Gemini in Google Docs, Sheets, Slides and Drive’ and my feedback to Chandu Thota is that I do not know of anyone who has gotten this to usefully work.
Huh, Upgrades
ChatGPT makes it easier to reference files, including across conversations.
On Your Marks
It turns out there are various ways to solve ARC-AGI tasks.
Well, now that we’ve beaten ARC-1 and ARC-2, I suppose it’s time for ARC-3?
Cue the goalpost moving gif, as we repeat the process:
La de da. Fun times for the whole family. Unless you all die. Real shame.
That’s a motte and bailey.
Obviously, if the AI needed human intervention on each problem, that’s not AGI.
However, if the AI uses a general harness, such as a better system prompt or something like Claude Code, why is that not AGI? The GI must lie within the weights in particular?
That’s kind of like saying if a human can do it, they should be able to do it without arms, legs, tools or pants. Or at least it’s like having the human do a test in an isolated room, in the dark, and then concluding the human can’t do the task in general.
“That’s not a fully general intelligence,” they said as… well, you know the rest.
The real benchmark is when they announce ARC-AGI-4.
Get My Agent On The Line
Rachel the AI agent phoned 3,000 pubs to price a pint, for a total cost of about €200, via explicitly pretending to be a potential customer, using ElvenLabs. Too much of this and you won’t be able to call the pub.
A common technology diffusion story is that it gets too big and useful too fast, and then no one can stop it even if they want to, whether or not they should want to.
I have been saying, for years, that the only effective bottleneck on AI deployments is to not create the relevant AI models in the first place. Once the models exist, ‘oh we will not use it for autonomous killer robots or mass domestic surveillance’ or ‘we will not use it for agents’ is a much higher lift and basically not going to happen in general. All arguments of the form ‘no one would be so stupid as to’ are wrong, as is ‘we would collectively ban that.’
The exception is if we can create a regulatory bottleneck where we can make the action effectively illegal, and draw a clear boundary around it, and it involves impacting the physical world, and also the governments do not want to do it. Then you have a shot, and you can prevent entire societies from, for example, curing diseases or building houses, and keep the price of things like practicing law or medicine or cutting hair artificially high indefinitely, up to a point.
But agents? Yeah, that’s a Butlerian Jihad level of intervention. Can’t stop it.
Deepfaketown and Botpocalypse Soon
OpenAI now monitors 99.9% of internal coding traffic for misalignment using GPT-5.4-Thinking. I don’t know that it needs to be the full 100% but something like this should be standard practice, along with some percentage of external traffic.
Attorney uses Claude to write a top-level law review article in 15 hours instead of 150 hours. If this had been 15 minutes I’d say definitely not cool, but where is the line? What do we make of this use case?
I would be inclined to say that if the attorney wrote most of the words and reviewed everything carefully, then in practice that seems fine. The 135 wasted hours are a bug, not a feature. Jessica Tillipman points out that this would have earned a human a co-author credit, but it’s not like the attorney can do that here.
Fun With Media Generation
Lyria 3 Pro lets songs extend to three minutes. Often human songs are longer than three minutes, but I’m not convinced the world would be worse if there was a hard three minute limit.
OpenAI’s intent to allow ‘x-rated’ talk freaked out its own advisors when they were told in January that OpenAI was forging ahead. Hence the delay.
That is good. Advisors exist to freak out about things, both things like ‘we might kill everyone on the planet’ and also things like this. If your advisors don’t freak out about the little things, that’s a terrible sign. It’s your job to address the freaking out and explain why it’s going to be fine, and to pause your move until you can do that.
Okay, that’s a bit much, but it’s also how the media would play such incidents.
The age classification system has a false negative rate of 12%. It didn’t say the false positive rate, which also seems important. I’d also want to know how often the errors are about 16-21 year olds. Differentiating a 17 vs. 18 year old via chat characteristics seems theoretically impossible.
RIP the Sora app. I set expectations at Google+ or Clubhouse. Calibration confirmed.
Greetings From The Torment Nexus
From the people who brought you Facebook and Instagram, it’s ChatGPT.
First we had Fijo Simo, architect of the news feed, as Head of Product.
Now we have Dave Dugan, former Meta Executive, leading their ad push.
A Young Lady’s Illustrated Primer
If you let LLMs edit your writing, either based on human feedback or otherwise, they will tend to transform it towards AI styles of writing. If you have the LLM do the writing, this is even more true. What else could it do? If anything, the evidence provided here by Natasha Jaques seems to show less distortion than one might expect.
Here, in one of her examples, the LLMs are mostly biased towards ‘neutrality’ or dodging a question that lacks a clear answer. That seems mostly fine, and was the general direction the paper mostly finds on various questions, a 70% increase in remaining neutral.
One could even explain this via selection effects. If you write an essay on any question, such as ‘does money buy happiness,’ then that makes it more likely you have a strong opinion. You’re less likely to talk about it if you have a neutral view. But the LLMs don’t get to choose their topics.
The real problem is that LLMs don’t know how to leave well enough alone:
You really do have to watch out for this. I don’t trust LLMs to edit my writing unless I am looking over ever edit exactly because they often change meanings, and my words very much have precise meaning.
Not weighing particular concerns enough is always a Skill Issue, fixable by prompting. The bigger issue would be if the concern could not be evaluated properly at all. But an LLM reviewer might indeed suffer in practice from such skill issues.
Canvas introduces an ‘AI teaching agent’ for ‘low-value tasks.’ This excludes AI grading, because that keeps ‘humans in the loop’ despite teachers I know thinking most grading (or at least grading of things that aren’t essays or other long writing) is a low-value task. Education folks are paranoid that AI will ruin education, because deep down they know that the whole enterprise makes no sense and if you remove various frictions it will collapse.
You Drive Me Crazy
You very much have to worry about framing with AIs. The example here is that Claude responds far differently to a supposed Senator Sanders than a supposed President Trump, and things like mode of interaction change answers as well. The problem exists across all LLMs.
They Took Our Jobs
If AI makes everyone twice as productive, one possibility is that everyone works half as hard. Another possibility is everyone works twice as hard, on top of being twice as productive, because half of people are about to get fired, or people see that you’ve completed your other work faster and suddenly give you more tasks.
AI has made me substantially more productive. I am not choosing to work less, even though there is no chance I would be fired if I worked less.
AI productivity gains are extremely high in ‘greenfield’ situations where you can start from scratch. When you are trying to update legacy systems and legacy code that lacks documentation and everything has to operate in real time and if any little thing changes someone throws a fit, it gets harder. Dreams of ‘oh we will build a new HR tool ourselves, from scratch’ do not work out so well for those without expertise.
For now. And yes, for now you can and often should outsource such work to people who can do better, and can get the proper benefits from the services. Tyler Cowen calls this a ‘slow takeoff’ of AI, much to my frustration, because such terms don’t keep their original meanings. But that’s okay.
The ‘slow’ here is that people got dreams they could suddenly do everything themselves, and tried to do that a bit early, when instead they can only do it massively cheaper and better than before by hiring those who know how. How disappointing. Or they could wait another 6-12 months.
In jobs we might not mind them taking given the alternative (see how that works?), Mark Zuckerberg is ‘building a CEO agent to help him do his job.’
Zuckerberg wants everyone at Meta to have their AI agent and has made use of AI a factor in their performance reviews.
When the next crazy thing happens at Meta, and you’re curious why, remember this and that they bought Manus and Moltbook:
Assuming we are not looking at full-on rapid capability advancement (aka recursive self-improvement), there are good reasons to think that, even if AI made it possible to profitably automate large portions of the economy, diffusion of that technology would take longer than you think.
There is also quite a lot of economics hopium running around. This is in response to Anthropic CEO Amodei doing four minutes on Fox News talking about what AI will be able to do.
What Dario actually says here is ‘I would not be surprised if within 1-5 years we start to see big effects here.’
Previously Dario has made predictions that seemed too aggressive at the time, and which indeed proved too aggressive although directionally correct. No, 90% of code wasn’t written by AI last year, but far more of it was than most predicted, and I’m guessing we get there in 2027. Here I think he’s flat out correct if you listen.
I’d go farther, in that if we don’t see ‘big effects’ on entry level jobs within five years, then I will be very surprised, and depending on how big is big it would surprise me within two or three. So 1-5 years seems like a good confidence interval for big effects on the entry-level job market.
This is very much on an exponential. Anthropic is growing roughly 10x every year, and if anything that has been accelerating, so ‘I can’t find it in the statistics yet’ makes you sound like people dismissing Covid-19 in February 2020, or at best in January.
I think that the economic impacts of AI have been rather surprising, actually. If nothing else, the CapEx spending is dramatic, as has been the revenue growth and valuations of the AI companies. Total investment, and return on investment, seem like very important economic claims with which to measure an ongoing exponential. Indeed, when those numbers were small, they were used as a central economic counterargument by many econ types.
Cowen’s Second Law is that all propositions about real interest rates are wrong, but I do believe that you cannot explain the combination of RGDP resilience and productivity growth with the poor experienced state of employment, if you don’t factor in AI. The people on the street are feeling the impact, already, now.
That’s not hiring, like the stock market, is forward looking. Most white collar hiring now, especially for entry level work, only pays off for both worker and employer years down the line, after a training and adoption period and paying off fixed costs. No one (with notably rare exceptions, famously including Donald Trump) likes firing people and it gets expensive. Even if I could have a job for you now, if I expect AI to take your job a few years later, often that will mean I don’t want to hire you today.
I also don’t buy the ‘prices adjust’ argument. Prices largely don’t adjust, or they adjust slowly whereas AI will happen fast. Wages are sticky downwards and must maintain relative status relationships, and must be enough to make workers willing to work. There are not only legal minimum wages, there are other forms of de facto minimums. On top of that, wages and workers are competing across the economy and across industries. If, for example, half of all entry-level finance positions went away, but employment overall did not collapse, my expectation is that wages on those jobs would decline very little.
I especially reject the ‘everything is an O-ring’ argument, that so long as some portion of the loop requires a human that this mean employment and wages do not fall. That doesn’t mean every augmentation or only partial automation is bad for employment, often they are good for employment, but you do not need full automation to make it bad for employment.
As a clean toy example, see the movie No Other Choice. The fact that there is still one job at the factory does not mean that there are not severe employment effects, and also note that (implicitly) wages for that job did not adjust so far downward, either.
That’s not to dismiss diffusion bottlenecks. Yes, for a given level of tech capability, things like employment effects will take longer than those at the labs would predict on their own, often a lot longer. But life is going to come at you fast, even if we do not face a singularity or total transformation, including that the AIs themselves will become quite good at assisting with diffusion and overcoming bottlenecks.
PoliMath gets paranoid and goes in a different direction and says that the rhetoric by Dario is irresponsible because the rhetoric is designed to cause companies to destroy people’s software engineering jobs, and not only thinks this is their plot to drive business but puts the responsibility on Anthropic (et al, presumably) to figure out how their tools can create jobs instead. I can absolutely assure him and everyone else that this is not how any of this works, on any level. Anthropic hurts itself with these warnings, no CEOs don’t listen to Anthropic pitches and prospectively fire half their engineers before implementing the replacement, and Anthropic shares these warnings because Dario thinks it is the socially responsible thing to do.
Here’s a Senator who is very much buying the hype on AI unemployment.
I, too, would be happy to book that bet from Senator Warner, but alas I was not in the room. This is in part because it would be very easy to profitably hedge that bet.
They Are Hiring
OpenAI to double its workforce this year to compete against Anthropic and push into business, from 4,500 employees to roughly 8,000.
The data here is from Ramp, which OpenAI disputes is a reasonable measure.
There is also this graph of ‘popularity among businesses’ overall:
That is presumably ‘uses at all’ rather than share of revenue.
It will be interesting to extend these graphs into March, and see what impact the DoW vs. Anthropic situation had on such choices. I could see it moving either way.
Levels of Friction
This was said well enough it went viral, so good job.
As always, Eternalist is correct. If an action imposes a cost, it requires friction, and the best friction is to require payment, even if it is refundable or very small.
In Other AI News
Composer 2 appears to be Kimi 2.5 with RL, being offered in an open license, without getting Moonshot’s permission as per Xinya Zhou. Intellectual property, what’s that?
Santi Ruiz joins Anthropic’s editorial team.
David tops the HuggingFace ‘Open LLM Leaderboard’ by taking Qwen2-72B, duplicating a block of seven middle layers and stitching them back together, which he models as ‘give it more time to think.’ This sounds like Frankenstein levels of mad science nonsense at first glance, but I see no reason he would be lying.
New compression algorithm just dropped.
Show Me the Money
OpenAI made a $50 billion dollar deal with Amazon, and Microsoft is contemplating legal action on grounds that this breaks its exclusive cloud partnership with OpenAI. OpenAI and Amazon claim they have found their way around the Microsoft contract.
I notice that I am confused by Microsoft’s position here, given they own on the order of 27% of OpenAI. It seems like an unwise business move to keep OpenAI shackled. It couldn’t be happening to a nicer pair of giant corporations.
OpenAI is offering private equity firms a guaranteed return of 17.5% along with early access to new models.
To state the obvious, if you are offering 17.5% returns then it’s a trap, and at minimum it is very much not guaranteed. Saying ‘I will pay you 17.5%’ is, as those who parked their money in various crypto platforms knows well, a way of saying ‘there is a good chance I am going to default on this.’
What you want, if your conscious permits and you’re not too worried about ‘what money will be worth in a post-AGI world’, is equity. If OpenAI succeeds, it is probably going to give you a lot more than 17.5% returns, even from today’s valuations. If OpenAI fails, money gone. There presumably won’t be zero recovery, but it won’t be pretty. So capture the upside, or seek your returns elsewhere.
Elon Musk announces a $20 billion project called TERAFAB.
Here’s one sober and reserved evaluation of his prospects here:
A relevant thing about Elon Musk is that, while he has a lot of technical expertise and can accomplish a lot of seemingly impossible tasks, he also just says things.
For example, here’s another thing he just said this week, in a trick he’s pulled several times without delivering, where the prediction market is at 12% but that seems rather high to me:
Just saying things, and announcing with confidence he will do things he probably cannot do, is central to his strategy of then yelling at people to sleep on floors until they manage to do it, which occasionally works to at least some extent. Elon Musk may plausibly start such a project, but the chances he achieves the goals he is stating are very low.
Announce periodically you are going to the moon and stars, and if one time you end up with SpaceX, it’s still a win. It’s worked for him quite well, so far.
Mostly what we can say is that Musk intends to make a serious effort to do domestic chip manufacturing, which will rapidly converge towards something a lot more realistic than his absurd ‘oh I will simply do everything myself in Austin’ claims. It won’t be anything like the size he is announcing, but he doesn’t care. It’s not like he’s making material statements about large corporations, and he is immune to the SEC the way that certain others are immune to time-travel-enabled assassination attempts.
Thinking big, Elon Musk believes, is good, and realism is optional.
Yeah, okay, maybe slow your roll on the Kardashev scale.
Jeff Bezos in talks to raise $100 billion for AI manufacturing fund, as in a private equity style play where you buy manufacturing companies and then apply AI.
Kimi raises $1 billion at $18 billion valuation, up 4x in three months. On the one hand that seems remarkably low given the quality of their models, on the other hand it is not clear how they monetize or that they can aspire to compete with the top tier. This is considered big, but remember that Anthropic’s last raise was $30 billion at $380 billion, OpenAI’s was $100 billion, and Google is Google.
Meanwhile, things that are not that much smaller than Kimi, in relative terms:
One underrated problem with open models is that the business model is terrible. You are spending a lot of fixed costs to create a product, and then giving that product away. How are you going to make money? It’s not a surprise that top open model people keep getting poached, here with Microsoft poaching the AI2 leadership team. Alexander Doria’s proposed solution here is to use regulation to actively give open models a structural advantage, which has been their longstanding policy goal.
Quickly, There’s No Time
The person first used the term AGI, Mark Gubrud, declares we have AGI.
How smart are AIs right now? Ryan Greenblatt sees them as not that ‘smart’ yet, but compensating with vast knowledge and very strong mostly-narrow heuristics. That is a lot of how humans seem smart when they seem smart, but yes there is a G-component and they do seem to lag the smarter humans there for now. Ryan predicts, I think correctly, that if they can match our raw intelligence they will quickly be de facto superintelligent and off to the races, given their many other advantages.
Jeffrey Ladish points out that to write a program one must first understand the universe. That’s not how he puts it, but the point is that you need to understand what you are building and why you are building it, which requires strong general intelligence. A fully narrow AI coder would not get so far.
The Week in Audio
We have the audio of Neil deGrasse Tyson calling for an international treaty to stop superintelligence.
The full Isaac Asimov memorial debate (1 hour 40 min) is here.
Jeffrey Ladish and others speak about AI risks with ABC Nightline, including that AIs can disobey instructions.
David Shor and Byrne Hobart are perfect guests to go on Odd Lots and discuss the politics of AI. Did you know people are not going to like that?
One fun note from that episode is that people really hate data centers in their area, but they can be bribed pretty easily if it comes with benefits like lower tax bills. People don’t understand that things cost and create money, and it’s a problem.
Dean Ball talks to James Pethokoukis.
Jensen Huang went on the All-In Podcast. I am not paid enough to listen, but it is presumably relevant to some of your interests.
Huang is correct here, of course, and if anything that threshold seems low.
Dylan Patel on Dwarkesh Patel on bottlenecks to scaling AI compute. A plausible candidate for a full post treatment.
OpenAI podcast discusses the OpenAI Model Spec.
80,000 Interviews About AI
Anthropic had Claude offer to interview its users about what they want out of AI. They got over 80,000 people to take part.
The people have hope. The people are alarmed. Often they are the same people.
The anecdotes and pull quotes are something, but you have to worry about whether they are representative. It’s better to focus on the statistics and broader observations.
People want productivity from AI, but that is an abstraction. The point of productivity, like using AI to automate emails, was typically to free up time to spend on something else like family, rather than to be super great at answering those emails.
Demand for things like AI romantic connections is nonzero but low. It’s not what people set out to want. People are not so strategic. They want marginal gains. The people who wanted bigger things out of AI still wanted things like a cure for cancer or scaling personalized education, again marginal improvements to life.
So what did the people actually get? 81% of people said that AI had helped ‘take a step towards their stated vision.’
Remember, these are Claude users. They’ve got a highly above average share of the unequally distributed glorious AI future.
People remain concerned.
Mostly they are concerned about mundane harms, or at least proximate ways things go wrong with current AI, the same way they are seeking mundane utility. It’s hard to keep your eye on the future, and most people don’t actually believe in what is coming.
If you ask people about specific concerns, they will often say they are concerned. But when asked what concerns they have, this chart is what is top of mind.
‘No concern’ would have been about 12th on this list at 11%.
These are all valid concerns. Any of these could be a big problem.
The United States tends to have more AI negativity than most nations, but using Claude mostly screens this off, and we come out about average and similar to other Western nations, although developing countries (Latin America, India, Africa, Middle East, Southeast Asia) tend to be more positive. Claude isn’t available in China.
As Anthropic points out, the fears often line up with the hopes.
I often say:
Similarly, one could say:
We then get emotional support versus dependence, time-saving versus illusory productivity and economic empowerment versus displacement.
One can think of it this way:
Or:
If you choose door number two on these dilemmas, it generally won’t go well.
The long term problem is:
That’s going to be a big problem.
The Lighter Side
It’s a little late for this one…
In other ways, it’s still early.
But not that early.
One leads to the other.
As in, if AGi goes unchecked then soon you’ll be the late Scottie Pippen.
Or you might be the regular form of late. If you don’t know about the block button, at some point that’s on you.