I wonder what the natural way to apply Wright's law to AI is. Cumulative GPU hours? Total quality adjusted inference usage?
It's easiest to negotiate and most useful if both sides have a reasonably good estimate of how many chips the other side has. I think it will probably be easy for intelligence agencies on either side to get an estimate within +/- 5% which is sufficent for a minimal version of this. This is made better by active efforts and by government trying to verifiably demonstrate how many chips they have as part of negotiation.
(This is in part because I don't expect active obfuscation until pretty late, so normal records etc will exist.)
When I say "misaligned AI takeover", I mean that the acquisition of resources by the AIs would reasonably be considered (mostly) illegitimate, some fraction of this could totally include many humans surviving with a subset of resources (though I don't currently expect property rights to remain intact in such a scenario very long term). Some of these outcomes could be avoid literal coups or violence while still being illegitimate; e.g. they involve doing carefully planned out capture of governments in ways their citizens/leaders would strongly object to if they understood and things like this drive most of the power acquisition.
I'm not counting it as takeover if "humans never intentionally want to hand over resources to AIs, but due to various effects misaligned AIs end up with all of the resources through trade and not through illegitimate means" (e.g., we can't make very aligned AIs but people make various misaligned AIs while knowing they are misaligned and thus must be paid wages and AIs form a cartel rather having wages competed down to subsistence levels and thus AIs end up with most of the resources).
I currently don't expect human disempowerment in favor of AIs (that aren't appointed successors) conditional on no misalignmed AI takeover, but agree this is possible; it doesn't form a large enough probability to substantially alter my communication.
People often say US-China deals to slow AI progress and develop AI more safely would be hard to enforce/verify.
However, there are easy to enforce deals: each destroys a fraction of their chips at some level of AI capability. This still seems like it could be helpful and it's pretty easy to verify.
This is likely worse than a well-executed comprehensive deal which would allow for productive non-capabilities uses of the compute (e.g., safety or even just economic activity). But it's harder to verify that chips aren't used to advance capabilities while easy to check if they are destroyed.
See also: Daniel's post on "Cull the GPUs".
I agree with the core message in Dario Amodei's essay "The Adolescence of Technology": AI is an epochal technology that poses massive risks and humanity isn't clearly going to do a good job managing these risks.
(Context for LessWrong: I think it seems generally useful to comment on things like this. I expect that many typical LessWrong readers will agree with me and find my views relatively predictable, but I thought it would be good to post here anyway.)
However, I also disagree with (or dislike) substantial parts of this essay:
I agree that:
I appreciate that:
If I had to guess a number based on his statements in the essay, I'd guess he expects a 5% chance of misaligned AI takeover. ↩︎
Dario also makes a somewhat false claim about the capability progression over the last 3 years (likely a mistake about the time elapsed due to sitting on this essay for a while?). He says "Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code." I think this was basically true for "4 years ago", but not 3. A little less than 3 years ago, GPT-4 was released and GPT-3.5 was already out 3 years ago (text-davinci-003). Both of these models could certainly write some code and solve a reasonably large fraction of elementary school math problems. ↩︎
Yeah, I don't buy that this model can/should be applied with r=1.6 to right now, though I agree it could be general enough in priciple.
Sorry, doesn't this web app assume full automation of AI R&D as the starting point? I don't buy that you can just translate this model to the pre full automation regime.
On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn't that useful for CBRN misuse countermeasures:
An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here's why.
I'll first acknowledge a potential strength of this approach. If models simply didn't know much about dangerous topics, we wouldn't have to worry about people jailbreaking them or stealing model weights—they just wouldn't be able to help with dangerous topics at all. This is an appealing property that's hard to get with other safety approaches.
However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work).
Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models' ability to do their own research improves.^ Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks.
^The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn't rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.
It looks as though steering doesn't suffice to mostly eliminate all evaluation awareness and unverbalized evaluation awareness if often substantially higher (this is for Opus 4.6, but presumably results also apply in general):