They commit to not using your data to train their models without explicit permission.
I've just registered on their website because of this article. During registration, I was told that conversations marked by their automated system that overlooks if you are following their terms of use are regularly overlooked by humans and used to train their models.
In Anthropic's support page for "I want to opt out of my prompts and results being used for training" they say:
We will not use your Inputs or Outputs to train our models, unless: (1) your conversations are flagged for Trust & Safety review (in which case we may use or analyze them to improve our ability to detect and enforce our Usage Policy, including training models for use by our Trust and Safety team, consistent with Anthropic’s safety mission), or (2) you’ve explicitly reported the materials to us (for example via our feedback mechanisms), or (3) by otherwise explicitly opting in to training.
Notably, this doesn't provide an opt out method, and the same messaging is repeated across similar articles/questions. The closest thing to an opt out seems to be "you have the right to request a copy of your data, and object to our usage of it".
I think I've figured out what you meant, but for your information, in standard English usage, to "overlook" something means to not see it. The metaphor is that you are looking "over" where the thing is, into the distance, not noticing the thing close to you. Your sentence would be better phrased as "conversations marked by their automated system that looks at whether you are following their terms of use are regularly looked at by humans".
What Comes Next
Coding got another big leap, both for professionals and amateurs.
Claude is now clearly best. I thought for my own purposes Claude Opus was already best even after GPT-4o, but not for everyone, and it was close. Now it is not so close.
Claude’s market share has always been tiny. Will it start to rapidly expand? To what extent does the market care, when most people didn’t in the past even realize they were using GPT-3.5 instead of GPT-4? With Anthropic not doing major marketing? Presumably adaptation will be slow even if they remain on top, especially in the consumer market.
Yet with what is reportedly a big jump, we could see a lot of wrappers and apps start switching over rapidly. Developers have to be more on the ball.
How long should we expect Claude 3.5 Sonnet to remain on top?
I do not expect anyone except Google or OpenAI to pose a threat any time soon.
Was surprised by the one-shot Hanson lecture result. From it, I decided to see if Claude could identify your text, so I plugged in the above after asking "Who do you think the author behind this paragraph is?" Response:
Based on the style, content, and context of this paragraph, I believe the author is likely Zvi Mowshowitz.
Here's my reasoning:
- The writing style is analytical and forward-looking, characteristic of someone who frequently discusses AI developments and their implications.
- The author mentions Claude specifically and compares it to GPT models, showing familiarity with different AI models and their capabilities.
- The discussion of market share, adoption rates, and speculation about future developments is typical of someone who closely follows the AI industry.
- The mention of "wrappers and apps" and consideration of developer behavior suggests someone with knowledge of the AI ecosystem.
- Zvi Mowshowitz is known for writing about AI developments, particularly on his blog "Don't Worry About the Vase" and other platforms. He often provides analysis of new AI models and their potential impacts.
- The structure and tone of the writing is similar to Zvi's typical style, with a mix of personal opinion and broader industry analysis.
However, I should note that while this is my best guess based on the available information, I can't be absolutely certain. If you're looking for a definitive answer, it would be best to verify the source directly.
Would you like me to elaborate on any part of this analysis?
Wow
As I mentioned on Twitter, this sort of 'truesight' for writers extensively represented in Internet corpora like Robin Hanson, Zvi, or myself, is very unsurprising. Like those slides - there are not a lot of places other than Overcoming Bias in the 2000s that all of those topics are represented. (Hanson has been banging those drums for a long time.)
I gave it a few paragraphs from something I posted on Mastodon yesterday, and it identified me. I'm at least a couple of notches less internet-famous than Zvi or gwern, though again there's a fair bit of my writing on the internet and my style is fairly distinctive. I'm quite impressed.
(I then tried an obvious thing and fed it a couple of Bitcoin-white-paper paragraphs, but of course it knew that they were "Satoshi Nakamoto" and wasn't able to get past that. Someone sufficiently determined to identify Satoshi and with absurd resources could do worse than to train a big LLM on "everything except writings explicitly attributed to Satoshi Nakamoto" and then see what it thinks.)
For Satoshi scenarios where you have a very small corpus or the corpus is otherwise problematic (in this case, you can't easily get new Satoshi text heldout from training), you could do things like similarity/distance metrics: https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us?commentId=MNk22rZeELjoh7bhW
Funny how at first it seemed obvious to me what "recursive self-improvement" means, and now...
On one end of the scale, almost any improvement will indirectly help at designing AI. Even if you invent a sharper pencil or a more nutritious version of soylent using the latest AI, it may ultimately help the AI developers become 0.000001% more productive.
The other end of the scale, I guess we could call it "fully automated recursive self-improvement" is where the AI creates the next generation of AI without any human input. Maybe with some extra requirements, such as reliability (as opposed to e.g. 10% probability of hallucinating a solution that couldn't possibly work, and then the AI happily replacing itself with the "improved" version). Not sure if we also require the AI to also physically build the next version, and to organize the entire economy.
But the real... thing that what we might reasonably call "recursive self-improvement"... is probably somewhere in between. If the AI can create a better design of some aspect of itself, then we are already there; now the question is whether it can improve all of its aspects, and whether it hits diminishing returns on that.
almost any improvement will indirectly help at designing AI
That may be too strong of a statement. Say some new tool helps improve AI legislation more than AI design, this might turn slowing down the wheel.
I think one way of framing it is whether the improvements to itself outweigh the extra difficulty in eking out more performance. Basically does the performance converge or diverge.
Although I have been reading about AI on Lesswrong for a year or two I have not installed or used any AI tools until now. I do not know why but something in this post encouraged me to do so yesterday.
As background to the reason for posting my comment, I am a vehicle dynamics engineer and spend a lot of my time tuning the ride and handling of all types of vehicles.
My prompts to Claude may be simplistic or unnecessarily polite but i am dipping my toes in the AI ocean. To say that I am impressed is an understatement. To those in the AI sector this is probably nothing of any significance but for me this is a game changing experience. The ease at which I can communicate with Claude is something that I am still absorbing….
Thanks for the posts (and your many many others).
——-
Good morning Claude, I would like you to help me create a Hydraulic damper modelling capability in Excel. Taking this one step at a time. What information do you need from me about the damper dimensions and type?
—
Good morning. To help you create a hydraulic damper modeling capability in Excel, I'll need several key pieces of information about the damper. Let's start with the basics:
1. Damper type: Is it a monotube or twin-tube damper?
2. Dimensions:
- Piston diameter
- Rod diameter
- Stroke length
- Overall length (compressed and extended)
3. Fluid properties:
- Type of hydraulic fluid used
- Fluid viscosity at operating temperature
4. Valving information:
- Type of valves (e.g., shim stack, orifice, or combination)
- Number and size of orifices (if applicable)
- Shim stack configuration (if applicable)
5. Operating conditions:
- Expected velocity range
- Operating temperature range
6. Any specific performance requirements or characteristics you're aiming for?
Once you provide this information, we can start building the model step by step. Is there any particular aspect of the damper's behavior you're most interested in modeling?
It's interesting that 3.5 Sonnet does not seem to match, let alone beat, GPT-4o on the leaderboard (https://chat.lmsys.org/?leaderboard). Currently it shows GPT-4o with elo 1287 and Claude 3.5 Sonnet at 1271.
Yeah, there's a decent amount of debate going on about how good 3.5 Sonnet is vs 4o, or if 4o was badly underperforming its benchmarks + LMsys to begin with. Has 4o been crippled by something post-deployment?* Is this something about long-form interaction with Claude, which is missed by benchmarks and short low-effort LMsys prompts? Are Claude users especially tilting into coding now given the artifact/project features, which seems to be the main strength of Claude-3.5-Sonnet?
Every year, it seems like benchmarking powerful generalist AI systems gets substantially harder, and this may be the latest iteration of that difficulty.
(Given the level of truesight and increasing level of persistency of account history, we may be approaching the point where different models give different people intrinsically different experiences - eg. something like, Claude genuinely works better for you than for me, while I genuinely find ChatGPT-4o more useful, because you happen to be politer and ask more sensible questions like Claude is a co-worker and that works better with the Claude RLAIF, while the RLHF crushes GPT-4o into submission so while it's a worse model it's more robust to my roughshod treatment of GPT-4o as a slave. Think of it as like Heisenbugs on steroids, or operant conditioning into tacit knowledge: some people just have more mana and mechanical sympathy, and they can't explain how or why.)
* I've noticed what seems like some regressions in GPT-4o since the launch, in my Gwern.net scripts, where it seems to have gotten oddly worse at some simple tasks like guessing URLs or picking keywords to bold in abstracts, and is still failing to clean some URL titles despite ~40 few-shot examples collected from previous errors.
There is a new clear best (non-tiny) LLM.
If you want to converse with an LLM, the correct answer is Claude Sonnet 3.5.
It is available for free on Claude.ai and the Claude iOS app, or you can subscribe for higher rate limits. The API cost is $3 per million input tokens and $15 per million output tokens.
This completes the trifecta. All of OpenAI, Google DeepMind and Anthropic have kept their biggest and more expensive model static for now, and instead focused on making something faster and cheaper that is good enough to be the main model.
You would only use another model if you either (1) needed a smaller model in which case Gemini 1.5 Flash seems best, or (2) it must have open model weights.
Updates to their larger and smaller models, Claude Opus 3.5 and Claude Haiku 3.5, are coming later this year. They intend to issue new models every few months. They are working on long term memory.
It is not only the new and improved intelligence.
Speed kills. They say it is twice as fast as Claude Opus. That matches my experience.
Low cost also kills.
They also introduced a new feature called Artifacts, to allow Claude to do various things in a second window. Many are finding it highly useful.
Benchmarks
As always, never fully trust the benchmarks to translate to real world performance. They are still highly useful, and I have high trust in Anthropic to not be gaming them.
Here is the headline chart.
Epoch AI confirms that Sonnet 3.5 is ahead on GPQA.
Anthropic also highlight that in an agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems versus 38% for Claude Opus, discussed later.
Needle in a haystack was already very good, now it is slightly better still.
There’s also this, from Anthropic’s Alex Albert:
You can say ‘the recent jumps are relatively small’ or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.
Human Evaluation Tests
We are still waiting for the Arena ranking to come in. Based on reactions we should expect Sonnet 3.5 to take the top slot, likely by a decent margin, but we’ve been surprised before.
Those were the high water marks, and Arena preferences tend to be less dramatic than that due to the nature of the questions and also those doing the rating. We are likely looking at more like a 60% win rate, which is still good enough for the top slot.
The Vision Thing
Here are the scores for vision.
Claude has an additional modification on it: It is fully face blind by instruction.
Artifacts
Other than ‘better model,’ artifacts are the big new feature. You have to turn them on in your settings, which you should do.
I have not had the opportunity to work with this feature yet, so I am relying on the reports of others. I continue to be in ‘paying down debt’ mode on various writing tasks, which is going well but is going to take at least another week to finish up. After that, I am actively excited to try coding things.
Privacy
They commit to not using your data to train their models without explicit permission.
Kudos, but being the only one who does this puts Anthropic at a large disadvantage. I wonder if this rule will get codified into law at some point?
Safety
There are two headlines here.
The review by UK’s AISI is very good news, especially after Jack Clark’s statements that making that happen was difficult. Now that both DeepMind and Anthropic have followed through, hopefully that will put pressure on OpenAI and others to do it.
The refusal rates are improvements over Opus in both directions, in terms of matching intended behavior.
Beyond that, they do not give us much to go on. The system card for Gemini 1.5 gave us a lot more information. I doubt there is any actual safety problem, but this was an opportunity to set a better example and precedent. Why not give more transparency?
Advancing the Frontier
Yes, Anthropic will advance the frontier if they are able to do so.
Recently, there was a discussion about whether 3.0 Claude Opus meaningfully advanced the frontier of what publicly available LLMs can do.
There is no doubt that Claude Sonnet 3.5 does advance it.
But wait, people said. Didn’t Anthropic say they were not going to do that?
Anthropic is sorry about that impression. But no. Never promised that. Did say it would be a consideration. Do say they held back Claude 1.0 for this reason. But no.
That’s the story Anthropic’s employees are consistently telling now, in response to the post from Dustin saying otherwise and Gwern’s statement.
My understanding after having investigated is that Anthropic made it clear that they would seek to avoid advancing the frontier, and that they saw doing so as a cost.
They did not, however, it seems, make any hard promises not to advance the frontier.
You should plan and respond accordingly. As always, pay very close attention to what is a hard commitment, and what is not a hard commitment. To my knowledge, Anthropic has not broken any hard commitments. They have shown a willingness to give impressions of what they intended to do, and then do otherwise.
Anthropic’s communication strategy has been, essentially, to stop communicating.
That has its advantages, also its disadvantages.
It makes sense to say ‘we do not want to give you the wrong idea, and we do not want to make hard commitments we might have to break.’ But how should one respond to being left almost fully in the dark?
The Race is On
Is the race on?
Yes. The race is on.
The better question is to what extent Anthropic’s actions make the race more on than it would have been anyway, given the need to race Google and company. One Anthropic employee doubts this. Whereas Roon famously said Anthropic is controlled opposition that exists to strike fear in the hearts of members of OpenAI’s technical staff.
I do not find the answer of ‘none at all’ plausible. I do find the answer ‘not all that much’ reasonably plausible, and increasingly plausible as there are more players. If OpenAI and company are already going as fast as they can, that’s that. I still have a hard time believing things like Claude 3.5 Sonnet don’t lead to lighting fires under people, or doesn’t cause them to worry a little less about safety.
Whispers of Recursive Self-Improvement
This is not the thing. But are there signs and portents of the thing?
Alex does this in a sandboxed environment with no internet access. What (tiny) percentage of users will do the same?
This is obviously not any sort of foom, or even a slow takeoff. Not yet. But yes, if the shift to Claude 3.5 Sonnet has substantially accelerated engineering work inside Anthropic, then that is how it begins.
To be clear, this is really cool so far. Improvement and productivity are good, actually.
Dean Ball is of course correct that improving labor productivity is great. The issue is when you get certain kinds of productivity without the need for any labor, or when the labor and time and compute go down faster than the difficulty level rises. Improvements accelerate, and that acceleration feeds on itself. Then you get true RSI, recursive self improvement, and everything is transformed very quickly. You can have a ‘slow’ version, or you can have a faster one.
Will that happen? Maybe it will. Maybe it won’t. This is a sign that we might be closer to it than we thought.
Logic Fails
It is time for an episode of everyone’s favorite LLM show, The New Model Is An Idiot Because It Still Fails On Questions Where It Incorrectly Pattern Matches.
Here’s another classic.
Davidad points out that it can be solved, if you ask Claude to write a solver in Python. Other contextual tricks work as well.
Colin of course also beats Claude Sonnet 3.5 at the first-to-22 game and Claude keeps failing to define a winning strategy.
Noam Brown wins at tic-tac-toe when going first.
As ever, the question:
The answer is that these questions are chosen because they are known to be exactly those six year olds can solve and LLMs cannot easily solve.
These are exactly the same failures that were noted for many previous LLMs. If Anthropic (or OpenAI or DeepMind) wanted to solve these examples in particular, so as not to look foolish, they could have done so. It is to their credit that they didn’t.
Remember that time there was this (human) idiot, who could not do [basic thing], and yet they gained political power, or got rich, or were your boss, or had that hot date?
Yeah. I do too.
Practical Reports
As a practical matter, what else could the answer be?
If Jan or anyone else skims a paper, or even if they read it, they will make mistakes.
If you have a faster and more accurate method, you are going to use it. It will sometimes be worth verifying the answer, and sometimes it won’t be. You use your judgment. Some types of statements are not reliable, others are reliable enough.
This is setting one up for a potential future where there is an intentional deception going on, either by design of the model, by the model for other reasons or due to some form of adversarial attack. But that’s also true of humans, including the paper authors. So what are you going to do about it?
Sully Omarr is very impressed.
Deedy is impressed based on responses in physics and chemistry.
Ethan Mollick seems impressed by some capabilities here.
Here’s a 3d physics simulation using WebGL in one shot.
Here it is explaining a maths problem in the style of 3blue1brown using visuals.
Here it is one-shot creating a Solar System simulation.
Here it is creating a monster manual entry for a Cheddar Cheese Golem.
Here it is generating sound effects if you paste in the ElevenLabs API.
Here it is one-shot identifying a new talk from Robin Hanson.
Here is Sully using Claude to regenerate, in an hour, the artifacts feature. Imagine what would happen if they built features that took longer than that.
Here is a thread of some similar other things, with some overlap.
There is always a downside somewhere: Zack Davis is sad that 3.5 Sonnet does not respond to ‘counter-scolding’ where you tell it its refusal is itself offensive, whereas that works well for Opus. That is presumably intentional by Anthropic.
Sherjil Ozair says Claude is still only taking amazing things humans have already done and posting them on the internet, and the magic fades.
What Comes Next
Coding got another big leap, both for professionals and amateurs.
Claude is now clearly best. I thought for my own purposes Claude Opus was already best even after GPT-4o, but not for everyone, and it was close. Now it is not so close.
Claude’s market share has always been tiny. Will it start to rapidly expand? To what extent does the market care, when most people didn’t in the past even realize they were using GPT-3.5 instead of GPT-4? With Anthropic not doing major marketing? Presumably adaptation will be slow even if they remain on top, especially in the consumer market.
Yet with what is reportedly a big jump, we could see a lot of wrappers and apps start switching over rapidly. Developers have to be more on the ball.
How long should we expect Claude 3.5 Sonnet to remain on top?
I do not expect anyone except Google or OpenAI to pose a threat any time soon.
OpenAI only recently released GPT-4o. I expect them to release some of the promised features, but not to be able to further advance its core intelligence much prior to finishing its new model currently in training, which has ambition to be GPT-5. A successful GPT-5 would then be a big leap.
That leaves Google until then. A Gemini Advanced 1.5 could be coming, and Google has been continuously improving in subtle ways over time. I think they are underdog to take over the top spot before Claude Opus 3.5 or GPT-5, but it is plausible.
Until then, we have a cool new toy. Let’s use it.