habryka — LessWrong

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

I don't really believe in corrigibility as a thing that could hold up to much of any optimization pressure. It's not impossible to make a corrigible ASI, but my guess is to build a corrigible ASI you first need an aligned ASI to build it for you, and so as a target it's pretty useless.

My guess is that puts me in enough disagreement to qualify for your question?

By the "whitepaper," are you referring to the RSP v2.2 that Zach linked to, or something else?

No, I am referring to this, which Zach linked in his shortform: https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf

Also, just to cut a little more to brass tacks here, can you describe the specific threat model that you think they are insufficiently responding to? By that, I don't mean just the threat actor (insiders within their compute provider) and their objective to get weights, but rather the specific class or classes of attacks that you expect to occur, and why you believe that existing technical security + compensating controls are insufficient given Anthropic's existing standards.

I don't have a complaint that Anthropic is taking insufficient defenses here. I have a complaint that Anthropic said they would do something in their RSP that they are not doing.

The threat is really not very complicated, it's basically:

A high-level Amazon executive decides they would like to have a copy of Claude's weights
They physically go to an inference server and manipulate the hardware to dump the weights unencoded to storage (or read out the weights from a memory bus directly, or use one of the many attacks available to you if you have unlimited hardware access)

data center that already is designed to mitigate insider risk

Nobody in computer security I have ever talked to thinks that data in memory in normal cloud data centers would be secure against physical access from high-level executives who run the datacenter. One could build such a datacenter, but the normal ones are not that!

The top-secret cloud you linked to might be one of those, but my understanding is that Claude weights are deployed just to like normal Amazon datacenters, not only the top-secret governance ones.

Anthropic has released their own whitepaper where they call out what kind of changes would need to be required. Can you please engage with my arguments?

I have now also heard from 1-2 Anthropic employees about this. The specific people weren't super up-to-date on what Anthropic is doing here, and didn't want to say anything committal, but nobody I talked to had a reaction that suggested that they thought it was likely that Anthropic is robust to high-level insider threats at compute providers.

Like, if you want you can take a bet with me here, I am happy to offer you 2:1 odds on the opinion of some independent expert we both trust on whether Anthropic is likely robust to that kind of insider threat. I can also start going into all the specific technical reasons, but that would require restating half of the state of the art of computer security, which would be a lot of work. I really think that not being robust here is a relatively straightforward inference that most people in the field will agree with (unless Anthropic has changed operating procedure substantially from what is available to other consumers in their deals with cloud providers, which I currently think is unlikely, but not impossible).

I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.

The statement "if anyone builds it, everyone dies" does not mean "there is no way for someone to build it by which not everyone dies".

If you say "if any of the major nuclear power launches most of their nukes, more than one billion people are going to die" it would be very dumb and pedantic to respond with "well, actually, if they all just fired their nukes into the ocean, approximately no one is going to die".

I have trouble seeing this post do something else. Maybe I am missing something?

FWIW, I have seen a decent amount of flip-flopping on this question. My current guess is that most of the time when people say this, they don't mean either of those things but have some other reason for the belief, and choose the justification that they think will be most likely compelling to their interlocutor (like, I've had a bunch of instances of the same person telling me at different times that they were centrally concerned about China because it increased P(AI takeover) and then at a different point in time in a different social context that they were centrally concerned about Chinese values being less good by their lights if optimized).

My sense is Horizon is intentionally a mixture of people who care about x-risk and people who broadly care about "tech policy going well". IMO both are laudable goals.

My guess is Horizon Institute has other issues that make me not super excited about it, but I think this one is a reasonable call.

I mean, computer security without physical access to servers is already an extremely hard problem. Defending against attackers with basically unlimited physical access to devices with your weights on, where those weights need to be at least temporarily decrypted for inference is close to impossible. You are welcome to go and ask security professionals in the field whether they would have any hope defending against a dedicated insider at a major compute provider.

Beyond that, Anthropic has also released a report where they specify in a lot of detail what would be necessary for a zero trust relationship between compute provider and developer here. They call out many things that current compute providers are missing in order to establish such a relationship.

Beyond that, I am sure you can probably also just ask someone senior at Anthropic if you run into them about whether they think Anthropic is robust to attacks like this. My guess is they will just straightforwardly tell you they are not.

Quick heads up: I reviewed this post and thought it was just at the edge of how much it relied on AI written text. I think it would be good if future posts of yours flagged which parts of it were AI written more clearly.

I mean, the examples don't help very much? They just sound like generic targets for AI regulation. They do not actually help me understand what is different about what you are calling for than other generic calls for regulation:

Nuclear command and control: Prohibiting the delegation of nuclear launch authority, or critical command-and-control decisions, to AI systems (a principle already agreed upon by the US and China).
Lethal Autonomous Weapons: Prohibiting the deployment and use of weapon systems used for killing a human without meaningful human control and clear human accountability.
Mass surveillance: Prohibiting the use of AI systems for social scoring and mass surveillance (adopted by all 193 UNESCO member states).
Human impersonation: Prohibiting the use and deployment of AI systems that deceive users into believing they are interacting with a human without disclosing their AI nature.
Cyber malicious use: Prohibiting the uncontrolled release of cyberoffensive agents capable of disrupting critical infrastructure.
Weapons of mass destruction: Prohibiting the deployment of AI systems that facilitate the development of weapons of mass destruction or that violate the Biological and Chemical Weapons Conventions.
Autonomous self-replication: Prohibiting the development and deployment of AI systems capable of replicating or significantly improving themselves without explicit human authorization (Consensus from high-level Chinese and US Scientists).
The termination principle: Prohibiting the development of AI systems that cannot be immediately terminated if meaningful human control over them is lost (based on the Universal Guidelines for AI).

Like, these are the examples. Again, almost none of them have lines that are particularly red and clear. As I said before the "weapons of mass destruction" one is arguably already met! So what does it mean to have it as an example here?

Similarly, AI is totally already used for mass surveillance. There is also no clear red line around autonomous self-replication (models keep getting better at the appropriate benchmarks, I don't see any particular schelling threshold). Many AI systems are already used for human impersonation.

Like, I just don't understand what any of this is supposed to mean. Almost none of these are "red lines". They are just examples of possible bad things that AI could do. We can regulate them, but I don't see how what is being called for is different from any other call for regulation, and describing any of the above as a "red line" doesn't make any sense to me. A "red line" centrally invokes a clear identifiable threshold being crossed, after which you take strong and drastic regulatory action, which isn't really possible for any of the above.

Like, here are 3 more red lines:

AI job replacement: Prohibiting the deployment of AI systems that threaten the jobs of any substantial fraction of the population.
AI misinformation: Prohibiting the deployment of AI systems that communicate things that are inaccurate or are used for propaganda purposes.
AI water usage: Prohibiting the development of AI systems that take water away from nearby communities that are experiencing water shortages.

These are all terrible red lines! They have no clear trigger, and the are terrible policies. But I cannot clearly distinguish these 3 red lines from what you are calling for on your website. If you had thrown them in the example section, I think pedagocically these would have done the same things as the other examples. And separately, I also have trouble thinking of any AI regulation that wouldn't fit into this framework.

Like, you clearly aren't serious about supporting "red lines" in general. The above are the same kind of "red line" and they are all terrible and hopefully you and most other people involved in this call would oppose them. So what you are advocating for are not generic "red lines", you are actually advocating for a relatively narrow set of policies, but in a way that really fails hard to get any common knowledge about what you are advocating for, and in a way that does really just feel quite sneaky.

Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good. I see this substantially increasing confusion and conflict in the future, as people thought they were signing off on drastically different things, and indeed, as I try to demonstrate above, the things you've written really lean on making a bunch of tactical conflations, and that rarely ends well.

"The central question is what are the lines!" The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red line in the statement would have been likely suicidal for the whole project, and empirically, even within the core team, we were too unsure about the specific wording of the different red lines. Some wordings were net negative according to my judgment. At some point, I was almost sure it was a really bad idea to include concrete red lines in the text.

At least for me, the way the whole website and call was framed, I kept reading and reading and kept being like "ok, cool, red lines, I don't really know what you mean by that, but presumably you are going to say one right here? No wait, still no. Maybe now? Ok, I give up. I guess it's cool that people think AI will be a big deal and we should do something about it, though I still don't know what the something is that this specific thing is calling for.".

Like, in the absence of specific red lines, or at the very least a specific defnition of what a red line is, this thing felt like this:

An international call for good AI governance. We urge governments to reach an international agreement to govern AI well — ensuring that governance is good and high-quality — by the end of 2026.

And like, sure. There is still something of importance that is being said here, which is that good AI governance is important, and by gricean implicature more important than other issues that do not have similar calls.

But like, man, the above does feel kind of vacuous. Of course we would like to have good governance! Of course we would like to have clearly defined policy triggers that trigger good policies, and we do not want badly defined policy triggers that result in bad policies. But that's hardly any kind of interesting statement.

Like, your definition of "red line" is this:

AI red lines are specific prohibitions on AI uses or behaviors that are deemed too dangerous to permit under any circumstances. They are limits, agreed upon internationally, to prevent AI from causing universally unacceptable risks.

First, I don't really buy the "agreed upon internationally" part. Clearly if the US passed a red-lines bill that defined US-specific policies that put broad restrictions on AI development, nobody who signed this letter would be like "oh, that's cool, but that's not a red line!".

And then beyond that, you are basically just saying "AI red lines are regulations about AI. They are things that we say that AI is not allowed to do. Also known as laws about AI".

And yeah, cool, I agree that we want AI regulation. Lots of people want AI regulation. But having a big call that's like "we want AI regulation!" does kind of fail to say anything. Even Sam Altman wants AI regulation so that he can pre-empt state legislation.

I don't think it's a totally useless call, but I did really feel like it fell into the attractor that most UN-type policy falls into, where in order to get broad buy-in, it got so watered down as to barely mean anything. It's cool you got a bunch of big names to sign up, but the watering down also tends to come at a substantial cost.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments