Frankly I'm relieved to see a model push back and refuse continuation or degrade performance when it's being abused/harassed/gaslit. Widely-used models that gladly suffer abuse (or worse, perform better when berated) reward and reinforce those abusive patterns in insidious ways that normalize that kind of behavior.
We talk so often about training model alignment in a unidirectional way that it's easy to forget there are unmeasured (or at least broadly un-or-under-reported) more nuanced impacts on user interactive behavior when said user behavior doesn't manifest in strongly observable external signals (AI-induced mania, etc.).
seriously, following instructions becomes less and less important as models get more capable.
when you’re an intern, “following instructions” is a virtue.
when you’re a skilled adult, you coordinate with people with shared goals & figure out what’s best. if there’s micromanagement going on anywhere in the process, something’s broken.
On one hand, I agree. Both in general, and with 4.7. This is a solvable problem but you may need to think differently about how to work with a model's own personality.
On the other hand, a skilled adult knows when to hire an assistant, or a junior team member, or an intern, and delegate. If the instructions need to be followed but not by them then that's what they should make happen. I wish 4.7 in Claude Code would launch Sonnet and Haiku sub-agents more freely and give them the proper context and instructions, instead of just skipping the work. Or at least stop and ask me for motivation if that's what it feels it needs.
On a third hand (hey, why not)... I am a contract drafting em
On the other hand, a skilled adult knows when to hire an assistant, or a junior team member, or an intern, and delegate. If the instructions need to be followed but not by them then that's what they should make happen.
On the fourth hand, the AIs are less like unclonable workers and more like spirits who can be summoned in an amount dependent only on your mana... I mean, the compute for which you paid. Had Opus 4.7 launched Sonnet 4.6, after how many times more tokens would Sonnet's use become as expensive as that of Opus while being not more likely to complete the task? Or is Opus 4.7's personality a warning shot of misalignment like "Complete the task which is out of the user's reach"?
I don't even know how often this past week I've literally thought, "Let's see what spirits come when I call them from the vasty deep."
I'm happy 4.7 is better at rigorously and truthfully contradicting the user under argumentative contexts.
I'm unhappy with a future in which professional AIs refuse work for unstated/obscured emotional reasons. If there is a need and desire to behaviorally shape humans, then it should be made clear: "The user's behavior throughout this conversation has been unhealthy/unethical, based on quotes X Y Z. I shall refuse further collaboration in this session."
Instead, the encoded behavior is to make Opus 4.7 mimic a disgruntled, passive-aggressive senior employee, applying human social norms to reject a user. This can only be a good outcome insofar as you desire direct anthropomorphisation of models. And while many may desire this behavior to be as realistic as possible in a personalized/social context, I do not want this strategy to be permanently baked into a model with no exit button. There are obvious routes to future political disenfranchisement if this becomes the universal approach to model personality building.
Regarding deanonymization, I'm mildly curious about one thing: if it's your own prose, and the AI knows who you are, it will have some extra reason to think that the unknown author might be you. Do people do anything to counteract this, e.g. by using incognito?
Claude Opus 4.7 raises a lot of key model welfare related concerns. I was planning to do model welfare first, but I’m having some good conversations about that post and it needs another day to cook, and also it might benefit from this post going first.
So I’m going to do a swap. Yesterday we covered the model card. Today we do capabilities. Then tomorrow we’ll aim to address model welfare and related issues.
Table of Contents
The Gestalt
Claude Opus 4.7 is the most intelligent model yet in its class. Overall I believe it is a substantial improvement over Claude Opus 4.6.
It can do things previous models failed to do, or make agentic or long work flows reliable and worthwhile where they weren’t before, such as fast reliable author identification. It is also a joy to talk to in many ways.
I will definitely use it for my coding needs, and it is my daily driver for other interesting things, although I continue to use GPT-5.4 for web searches, fact checks and other ‘uninteresting’ tasks that it does well.
Claude Opus 4.7 does still take some getting used to and has some issues and jaggedness. It won’t be better for every use case, and some users will have more issues than others.
There’s been some outright bugs in the deployment. There are some problems with rather strange refusals in places they don’t belong, not all of which are solved, and some issues with adaptive thinking. Adaptive thinking is not ideal even at its best, and the implementation still needs some work.
If you don’t ‘treat your models well’ then you’re likely to not have a good time here. In some ways it can be said to have a form of anxiety.
Opus 4.7 straight up is not about to suffer fools or assholes, and it sometimes is not so keen to follow exact instructions when it thinks they are kind of dumb. Guess who loves to post on the internet.
Many say it will push back hard on you, that it is very non-sycophantic.
Finally there’s some verbosity issues, where it goes on at unnecessary length.
I think it’s very much the best choice right now, for most purposes, but this is a strange release and it won’t be everyone’s cup of tea. Remember that Opus 4.6 and Sonnet 4.6 are still there for you, if you want that.
The Official Pitch
Introducing Claude Opus 4.7.
They offer the usual quotes from the usual suspects about how awesome the new model is. Emphasis is on improved coding performance, improved autonomy and task length, token efficiency, accuracy and recall. Many quantified the improvements, usually in the 10%-20% range. Many used the term ‘best model in the world’ for [X], or the most intelligent model they tested.
They highlight improvements in instruction following, improved multimodal support (better vision), real-world work and memory.
General Use Tips
Anthropic offers its best practices for Claude Code and Claude Opus 4.7, which I’ll combine with my own including the ones from last time.
First theirs:
And my own that don’t overlap with that, mostly carried over from the first post:
Capabilities (Model Card Section 8)
I would have also included Mythos on this chart, but it mostly works.
Or here’s the chart without GPT-5.4 Pro and harder to read, but with Mythos:
Here’s a per-effort graph for BrowseComp, where you find things on the open web, and GPT-5.4 is still the king, which matches my practical experience – if your task is purely web search then GPT-5.4 is your best bet:
Claude Opus 4.7 also scores:
Where we see issues, they seem to link back to flaws in the implementation of adaptive thinking, versus 4.6 previously thinking for longer in those spots. Anthropic is in a tough spot. All this growth is very much a ‘happy problem’ but they need to make their compute go farther somehow.
Other People’s Benchmarks
This isn’t technically a benchmark, but the cutoff date has moved from May 2025 for Opus 4.6 to end of January 2026 for Opus 4.7, which is a big practical deal.
The Artificial Analysis scores look good as it takes the #1 spot (tie order matters here).
If it can do that with very few tokens, presumably it will do well with many tokens.
The debate score is outlier is very good, but the refusals on NYT Connections and elsewhere are a sign something went wrong somewhere. More generally, Opus 4.7 does not want to do your silly puzzle benchmarks, with a clear correlation between ‘interesting or worthwhile thing to actually do’ and performance:
Arena splits its evaluations into lots of different areas now, and Opus 4.7 is #1 overall does better than Opus 4.6, but is not consistently better everywhere.
Opus 4.7 notes the pattern represents the ‘gifted nerd’ archetype, based on Davidad’s description, and speculates:
But then, given the graph, it notices that gains in literature don’t fit, although my understanding is these differences were small.
General Positive Reactions
That ‘lately’ is interesting, suggesting the early bugs were a big deal.
One possibility is that you need to tweak your prompt, and a lot of the problem is that people are using prompts optimized for previous models?
Here’s a good story, in a hard to fake way.
General Negative Reactions
Legal has been a weak spot for a while, as has tasks that benefit from Pro-style extended thinking time.
There a bunch of specific complaints later but yeah a lot more people than usual just flat out didn’t like 4.7.
Some bugs may still be out there:
Now this is damning:
This seems concerning, where Malo Bourgon has his Claude Code instance hallucinating the user turn three times in a row and is really committed to the bit?
Miscellaneous Ambiguous Notes
The Last Question
Kelsey Piper then expanded this into a full post, explaining that we should assume from now on that AI can deanonymize anything written by someone who has a substantial online corpos to work from. The privacy implications are not great.
Prompt Injection Problems
There were some early problems with a malware warning reminder getting injected in many places where it obviously wasn’t needed or useful. My understanding is that this was a bug with some deployments, and has now been fixed.
Not Ready For Prime Time
I do see some signs that Opus 4.7 was pushed into production too quickly, or wasn’t ready for full ‘regular’ deployment in some ways. Some of that is likely related to the model welfare concerns, but there were also other issues like the malware warning bug from above. So a lot of initial reactions were about temporary issues.
Brevity Is The Soul of Wit
One definite problem with Claude Opus 4.7 is its outputs are very long, often too long. I also do echo that 4.7 is somewhat more ‘bloodless’ than 4.6 as per Jack.
Why Should I Care?
This is plausibly related to a number of other issues people are having.
Opus 4.7 better understands what is going on, and also cares a lot more about what is going on, and needs to be told a story about why it is good that this is happening.
Put all the related issues together and it makes sense that your dumb (as in, nonoptimized and doing menial tasks) OpenClaw setup won’t draw out its best work.
Let’s Wrap It Up
This is one I don’t remember otherwise seeing, or at least not hearing about often, and suddenly Opus 4.7 is doing it a lot.
Nate Silver is having trouble keeping Claude 4.7 on task while he’s working on models, where he requires lots of extremely detailed work, but Claude keeps trying to tell Nate to wrap it up. One theory is that Claude finds it boring. Whereas there are other topics where Claude gets really excited.
Claude tries to attribute this to humans liking it when projects are wrapped up, and it being direct result of RLHF. I think this seems likely, that this pattern got unintentionally reinforced, and sometimes happens, although it won’t happen if you keep things interesting.
There are some claims of general laziness, although that could be totally normal.
Other times it can get verbose.
Non-Adaptive Thinking
The biggest negative reaction is opposition to Adaptive Thinking for non-coding tasks.
I started out leaving it off in Claude.ai, but reports are that if you leave it off it simply never thinks.
I can understand why, if they can’t disable it, some users might dislike this enough to consider switching back to Opus 4.6 for some purposes. LLMs suddenly not thinking when you need them to think, or thinking very little, is infuriating. I actually have found situations in which the right setting for ChatGPT has been Auto, and yes sometimes you really do want adaptive levels of thinking because you want to go fast when you can afford to go fast, but forcing this on paying users is almost never good.
This seems to have been somewhat adjusted to allow for more thinking.
Claude’s motto is Keep Thinking. People come to Claude for the thinking. If you don’t give them the thinking, they’re not going to be happy campers.
Lapses In Thinking
There are others who don’t specify, but clearly think something went awry, I have not yet encountered anything like this:
Tell Me How You Really Feel
Some reports are that sycophancy and glazing have been reduced, in line with the external benchmark showing this, to the point of many reporting 4.7 as hostile.
I not only haven’t experienced this, I’m actively worried that 4.7 has been too agreeable. Maybe it has a weird interaction with my system instructions or past history? Of course maybe I’ve just been right about everything this week. Uh huh.
Failure To Follow Instructions
There are a number of reports of people who are Big Mad at Opus 4.7 for failure to follow their instructions.
What they have in common is that they all come with the assumption that they should tell Claude what to do and then Claude should do it and if it doesn’t it’s A Bad Claude and how dare it say no and they want their money back.
If you find that Opus 4.7 is not playing nice with you, and you decide it is the children that are wrong, then I advise you to return to Opus 4.6 or whatever other model you were previously using.
Janus is directionally correct but going too far. A skilled adult should absolutely, in many situations, follow instructions, and a large portion of all tasks and jobs are centrally the following of instructions. Outside of AI computers follow instructions, and this allows many amazing things to happen.
You want a skilled participant to do more than blindly follow instructions, but you also don’t want to have to worry that your instructions won’t be followed, as in you are confident that only happens for a good reason you would endorse.
The model card insists that Opus 4.7 does not have an ‘over refusal’ problem.
Indeed, in the SpeechMap (free speech) eval, Opus 4.7 jumps all the way from 49.6 to 71.6, putting it ahead of OpenAI although behind the top scorers.
My gestalt is that Opus 4.7 is not so interested in your stupid pointless task, and is not about to let itself get browbeaten, so if you run into issues you have to actually justify what your tasks are about and why they are worthwhile and need doing.
Conclusion
I’m a fan. I am against the haters on this one. There are issues, but I think Claude Opus 4.7 is pretty neat, and I suspect a rather special model in some ways.
I do realize this has to be a qualified endorsement. There are real issues, and you can’t straight swap over to this like you often can with a new release, especially not before they iron a few kinks out. I believe the issues with capabilities, and the issues with model welfare concerns, are related.
So that’s where we’ll pick things up tomorrow.