LessWrong

ChatGPT-5.3-Codex Is Also Good At Coding

OpenAI is back with a new Codex model, released the same day as Claude Opus 4.6.

The headline pitch is it combines the coding skills of GPT-5.2-Codex with the general knowledge and skills of other models, along with extra speed and improvements in the Codex harness, so that it can now handle your full stack agentic needs.

We also got the Codex app for Mac, which is getting positive reactions, and quickly picked up a million downloads.

CPT-5.3-Codex is only available inside Codex. It is not in the API.

As usual, Anthropic’s release was understated, basically a ‘here’s Opus 4.6, a 212-page system card and a lot of benchmarks, it’s a good model, sir, so have... (read 5819 more words →)

AI #155: Welcome to Recursive Self-Improvement

Zvi

This was the week of Claude Opus 4.6, and also of ChatGPT-5.3-Codex. Both leading models got substantial upgrades, although OpenAI’s is confined to Codex. Once again, the frontier of AI got more advanced, especially for agentic coding but also for everything else.

I spent the week so far covering Opus, with two posts devoted to the extensive model card, and then one giving benchmarks, reactions, capabilities and a synthesis, which functions as the central review.

We also got GLM-5, Seedance 2.0, Claude fast mode, an app for Codex and much more.

Claude fast mode means you can pay a premium to get faster replies from Opus 4.6. It’s very much not cheap, but it can... (read 16618 more words →)

Claude Opus 4.6 Escalates Things Quickly

Zvi

Life comes at you increasingly fast. Two months after Claude Opus 4.5 we get a substantial upgrade in Claude Opus 4.6. The same day, we got GPT-5.3-Codex.

That used to be something we’d call remarkably fast. It’s probably the new normal, until things get even faster than that. Welcome to recursive self-improvement.

Before those releases, I was using Claude Opus 4.5 and Claude Code for essentially everything interesting, and only using GPT-5.2 and Gemini to fill in the gaps or for narrow specific uses.

GPT-5.3-Codex is restricted to Codex, so this means that for other purposes Anthropic and Claude have only extended the lead. This is the first time in a while that a model... (read 10004 more words →)

•••

Claude Opus 4.6: System Card Part 2: Frontier Alignment

Zvi

Coverage of Claude Opus 4.6 started yesterday with the mundane alignment and model welfare sections of the model card.

Today covers the kinds of safety I think matter most: Sabotage, deception, situational awareness, outside red teaming and most importantly the frontier, catastrophic and existential risks. I think it was correct to release Opus 4.6 as an ASL-3 model, but the process Anthropic uses is breaking down, and it not on track to reliably get the right answer on Opus 5.

Tomorrow I’ll cover benchmarks, reactions and the holistic takeaways and practical implications. I’m still taking it all in, but it seems clear to me that Claude Opus 4.6 is the best model out there... (read 5192 more words →)

Claude Opus 4.6: System Card Part 1: Mundane Alignment and Model Welfare

Zvi

Claude Opus 4.6 is here. It was built with and mostly evaluated by Claude.

Their headline pitch includes:

1M token context window (in beta) with State of the art retrieval performance.
Improved abilities on a range of everyday work tasks. Model is improved.
State of the art on some evaluations, including Terminal-Bench 2.0, HLE and a very strong lead in GDPval-AA.
Claude Code now has an experimental feature called Agent Teams.
Claude Code with Opus 4.6 has a new fast (but actually expensive) mode.
Upgrades to Claude in Excel and the release of Claude in PowerPoint.

Other notes:

Price remains $5/$25, the same as Opus 4.5, unless you go ultra fast.
There is now a configurable ‘effort’ parameter with four settings.
Refusals for

... (read 7757 more words →)

Claude Code #4: From The Before Times

Zvi

10d

Claude Opus 4.6 and agent swarms were announced yesterday. That’s some big upgrades for Claude Code.

OpenAI, the competition, offered us GPT-5.3-Codex, and this week gave us an app form of Codex that already has a million active users.

That’s all very exciting, and next week is going to be about covering that.

This post is about all the cool things that happened before that, which we will be building upon now that capabilities have further advanced. This if from Before Times.

Almost all of it still applies. I haven’t had much chance yet to work with Opus 4.6, but as far as I can tell you should mostly keep on doing what you were doing... (read 6795 more words →)

AI #154: Claw Your Way To The Top

Zvi

11d

Remember OpenClaw and Moltbook?

One might say they already seem a little quaint. So earlier-this-week.

That’s the internet having an absurdly short attention span, rather than those events not being important. They were definitely important.

They were also early. It is not quite time for AI social networks or fully unleashed autonomous AI agents. The security issues have not been sorted out, and reliability and efficiency aren’t quite there.

There’s two types of reactions to that. The wrong one is ‘oh it is all hype.’

The right one is ‘we’ll get back to this in a few months.’

Other highlights of the week include reactions to Dario Amodei’s essay The Adolescence of Technology. The essay was trying to... (read 12674 more words →)

Kimi K2.5

Zvi

12d

I had to delay this a little bit, but the results are in and Kimi K2.5 is pretty good.

Official Introduction

Introducing Kimi K2.5,

Kimi.ai: Meet Kimi K2.5, Open-Source Visual Agentic Intelligence.
Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.

Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
K2.5

... (read 2864 more words →)

Unless That Claw Is The Famous OpenClaw

Zvi

13d

First we must covered Moltbook. Now we can double back and cover OpenClaw.

Do you want a generally impowered, initiative-taking AI agent that has access to your various accounts and communicates and does things on your behalf?

That depends on how well, safely, reliably and cheaply it works.

It’s not ready for prime time, especially on the safety side. That may not last for long.

It’s definitely ready for tinkering, learning and having fun, if you are careful not to give it access to anything you would not want to lose.

... (read 4514 more words →)

•••

Welcome to Moltbook

Zvi

14d

Moltbook is a public social network for AI agents modeled after Reddit. It was named after a new agent framework that was briefly called Moltbot, was originally Clawdbot and is now OpenClaw. I’ll double back to cover the framework soon.

Scott Alexander wrote two extended tours of things going on there. If you want a tour of ‘what types of things you can see in Moltbook’ this is the place to go, I don’t want to be duplicative so a lot of what he covers won’t be covered here.

At least briefly Moltbook was, as Simon Willison called it, the most interesting place on the internet.

Andrej Karpathy: What’s currently going on at @moltbook is

... (read 8404 more words →)

Replying toWhen Were Things The Best?

Zvi2mo

When Were Things The Best?

Positional goods are better when everything is worse! Fair point!

Replying toClaude 4.5 Opus' Soul Document

Zvi3mo

Claude 4.5 Opus' Soul Document

This is potentially important context from Janus/Repligate, including the claim that it an incomplete/inexact version of something real: https://x.com/repligate/status/1994973338448662858

Replying toAI #144: Thanks For the Models

Zvi3mo

AI #144: Thanks For the Models

Intended tone was humorous, as in the 'you guys have [X]s?' meme, not to deny that Russia has such executives, although I haven't seen anything notable from Sberbank. I've certainly kept an eye on Mistral and SSI if no one else.

However right now I think I'd list at least 5 American labs and 4 Chinese labs as substantially ahead of anyone anywhere else until proven otherwise, excluding SSI which is impossible to get a read on.

Replying toAI #144: Thanks For the Models

Zvi3mo

AI #144: Thanks For the Models

Making that argument seems... unwise of them.

Replying toBubble, Bubble, Toil and Trouble

Zvi4mo

Bubble, Bubble, Toil and Trouble

I wouldn't obviously even put AMD on the list given that they're up on rather big single stock news, but yes, good note, there is that.

Replying toRealistic Reward Hacking Induces Different and Deeper Misalignment

Zvi4mo

Realistic Reward Hacking Induces Different and Deeper Misalignment

Would a reasonable way to summarize this be that if you train on pretend reward hacking you get emergent misalignment that takes the form of pretending (playacting) misbehaving and being evil, whereas if you here train on realistic reward hacking examples it starts realistically (and in some ways strategically) misbehaving and doing other forms of essentially reward hacking instead?