Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?
I really like this framing and question.
My model of Anthropic says their answer would be: We don't know exactly which techniques work until when or how fast capabilities evolve. So we will continuously build frontier models and align them.
This assumes at least a chance that we could iteratively work our way through this. I think you are very skeptical of that. To the degree that we cannot, this approach (and to a large extent OpenAI's) seem pretty doomed.
Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one.
Anthropic Hypothesis
The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case ‘use the word “paperclip” as many times as possible’) in the hopes of showing that RLHF won’t be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.
The problem with this approach is that an AI that ‘does its best to try’ is not doing the best that the future dangerous system will do.
So by this same logic, a test on today’s systems can only show your technique doesn’t work or that it works for now, it can never give you confidence that your technique will continue to work in the future.
They are running the test because they think that RLHF is so hopeless we can likely already prove, at current optimization levels, that it is doomed to failure.
Also, the best try to deceive you will sometimes be, of course, to make you think that the problem has gone away while you are testing it.
This is the paradox at the heart of Anthropic: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?
If we could safely build and work on things as smart and capable as the very models that we will later need to align, then this approach would make perfect sense. Given we thankfully cannot yet build such models, the ‘cutting edge’ is importantly below the capabilities of the future dangerous systems. In particular they are on different sides of key thresholds where we would expect current techniques to stop working, such as the AI becoming smarter than those evaluating its outputs.
For Anthropic’s thesis to be true, you need to thread a needle. Useful work now must require cutting edge models, and those cutting edge models must be sufficiently advanced to do useful work.
In order to pursue that thesis and also potentially to build an AGI or perhaps transform the world economy, Anthropic plans on raising $5 billion over the next two years. Their investor pitch deck claims that whoever gets out in front of the next wave can generate an unstoppable lead. Most of the funding will go to development of cutting edge frontier models. That certainly seems to, at best, be a double edged sword.
Even if Anthropic is at least one step behind in some sense, Dylan gives an excellent intuition pump for why that still accelerates matters:
What is the plan to ensure this is kept under control?
Anthropic Bias
This is a great move.
This is a very EA-flavored set of picks. Are they individually good picks?
I am confident Paul Christiano is an excellent pick. I do not know the other four.
Simeon notes the current Anthropic board of six is highly safety-conscious and seems good, and for the future suggests having one explicit voice of (more confident and robust) doom for the board. He suggests Nate Sores. I’d suggest Jaan Tallinn, who is also an investor in Anthropic. If desired I am available.
My biggest concerns are Matt Levine flavored. To what extent will this group control the board, and to what extent does the board control Anthropic?
How easy will it be to choose three people for the board who will stand together, when the Amodeis have two votes and only need one more? If Dario decides he disagrees with the board, what happens in practice, especially if a potentially world-transforming system is involved? If this group is upset with the board, how much and how fast can they influence the board, either in normal mode and via their potential power to act in the future, or when they are in an active conflict and have to fight to retake control? Safety delayed is often safety denied. Could the future current board change the rules? Who controls what a company does?
In such situations, everyone wanting to do the right thing now is good, but does not substitute for mechanism design or game theory. Power is power. Parabellum.
Simeon notes that Jack Clark’s general unease about corporate governance.
This is a great example of the ‘good news or bad news’ game. If you don’t already know that the incentives of corporations are horrendously warped in ways no one including Anthropic knows how to solve, and how hard interventions like this are to pull off in practice, then this quote is bad news.
However, I did already know that. So this is instead great news. Much better to realize, acknowledge and plan for the problem. No known formal intervention would have been sufficient on its own. Aligning even human-level intelligence is a very difficult unsolved problem.
I also do appreciate the hypothesis that one could create a race to safety. Perhaps with the Superalignment Taskforce at OpenAI they are partially succeeding. It is very difficult to know causality:
I am happy to let Dario have this one, and for them to continue to make calls like this:
The key is to ensure that the race is aiming at the right target, and can cut the enemy. The White House announcement was great, I will cover it on Thursday. You love to see it. It still is conspicuously missing a proper focus on compute limits and the particular dangers of frontier models – it is very much the ‘lightweight’ approach to safety emphasized by Anthropic’s Jack Clark.
This is part of a consistent failure from Anthropic to publicly advocate for regulations that have sufficient bite to keep us alive. Instead they warn to stick to what is practical. This is in contrast to Sam Altman of OpenAI, who has been a strong advocate for the need to lay groundwork to stop or at least aggressively regulate frontier model training. Perhaps this is part of a longer political game played largely out of the public eye. I can only judge by what I can observe.
The big worry is that while Anthropic has focused on safety in some ways in its rhetoric and behavior, building what is plausibly a culture of safety and being cautious with its releases, in other areas it seems to be failing to step up. This includes its rhetoric regarding regulations. I would love for there to be a fully voluntary ‘race to safety’ that sidesteps the need for regulation entirely, but I don’t see that as likely.
The bigger concern of course is the huge raise for the explicit purpose of building the most advanced model around, the exact action that one might reasonably say is unsafe to do. There is also the failure to yet establish corporate governance measures to mitigate safety concerns (although the new plan could change this.
It is also notable that, despite it being originally championed by CEO Dario Amodei, Anthropic has yet to join OpenAI’s ‘merge and assist’ clause, which reads:
This is a rather conspicuous refusal. Perhaps such a commitment looks good when you expect to be ahead, not so good when you expect to be behind, and especially not good when you do not trust those you think are ahead? If Anthropic’s attitude is that OpenAI (or DeepMind) is sufficiently impossible to trust that they could not safely merge and assist, or that their end goals are incompatible, then they would be putting additional pressure on for the other, unsafe, lab to move quickly when they most need to not do that. It is exactly when you do not trust and agree with your rival that cooperation is needed most.
There is some hand wringing in the post about Anthropic not being a government project, along with the simple explanation of why it cannot possibly be one, which is that the government would not have funded it in any reasonable time frame, and even if they did they would never be willing to pay competitive salaries. It also for similar reasons is not a non-profit, which will push Anthropic in unfortunate directions in ways that could well be impossible to fight against. If they were a non-profit, they would not be getting billions in funding, and also people would again look askance at the compensation.
Conclusion
The profile updated me positively on Anthropic. The willingness to grant this level of access, and the choice of who to give it to, are good signs. The governance changes are excellent. The pervasive attitude of being deeply worried is helpful. Also Dario has finally started to communicate more in public more generally, which is also good.
There is still much work to be done, and there are still many ways Anthropic could improve. Their messaging and lobbying needs to get with the program. Their concepts of alignment seem to not focus on the hard problems, and they continue to put far too much stock (in my opinion) on techniques doomed to fail.
Most concerning of all, Anthropic continues to raise huge amounts of money intended for the training of frontier models designed to be the best the world. They make the case that they do this in the name of safety, but they tell investors a different story that is highly plausible to be true in practice whether or not Anthropic’s principals want to believe it will prove true.
Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?
Similarly, why should we think that techniques that work in the most advanced current (or near future) systems will transfer to future more capable AGIs, if techniques on past systems wouldn’t? If we don’t think previous systems were ‘smart enough’ for the in-future effective techniques to work on them, what makes us think we are suddenly turning that corner as well?
The thesis that you need exactly the current best possible models makes sense if the alignment work aims to align current systems, or systems of similar capability to current systems. If you’re targeting systems very different from any that currently exist, more so than the gap between what Anthropic is trying to train and what Anthropic already has, what is the point?
Remember that Anthropic, like OpenAI and DeepMind, aims to create AGI.
Final note, and a sign of quality thinking: