All of Yonatan Cale's Comments + Replies

Seems like Unicode officially added a "person being paperclipped" emoji:

An emoji of a person with a paperclip going through their head

Here's how it looks in your browser: 🙂‍↕️

Whether they did this as a joke or to raise awareness of AI risk, I like it!

Source: https://emojipedia.org/emoji-15.1

7Davidmanheim
This one shows the paperclip more clearly for me; 🙂‍↔️

Hey,

On anti-trust laws, see this comment. I also hope to have more to share soon

I asked Claude how relevant this is to protecting something like a H100, here are the parts that seem most relevant from my limited understanding:

 

What the paper actually demonstrates:

1.⁠ ⁠Reading (not modifying) data from antifuse memory in a Raspberry Pi RP2350 microcontroller
2.⁠ ⁠Using Focused Ion Beam (FIB) and passive voltage contrast to extract information
 

Key differences between this and modifying an H100 GPU:

  1. 3D Transistor Structures: Modern 5nm chips use FinFET or GAAFET 3D structures rather than planar transistors. The critical parts are
... (read more)
1Jonathan_H
That's correct.  That said, chip modifications are done on the same FIB machine. The cost estimate still seems accurate to me.   H100s are manufactured on TSMC's "3nm" node (a brand name), which has a Gate Pitch of 48 nm and a Metal Pitch of 24 nm. The minimum feature size is 9-12nm, according to Claude. You are not at the physical limitations: * Gallium-based FIB circuit edits can go down to ~7nm. * Helium-based FIB circuit edits (~3...5x more expensive than Gallium FIB) can go down even further, 1-2nm.   I'd attack the silicon from the backside of the waver. Currently, nearly no chip has conductive traces or protection mechanisms on the backside. And you can directly get to the transistors & gates without needing to penetrate the interconnect structure on the front side of the waver. (Though I want to flag that I heard that there is a push to have power distribution on the backside of the waver. So this might become harder in 2-6 years.)   I also want to flag that the attack we are discussing here (modifying the logic within the H100 die) is the most advanced invasive attack I can currently think of. Another simpler attack is to read out the secret key used for authentications. Or even simpler, you could replace the CEC1736 Root-of-Trust chip on the H100-PCB (which authenticates the H100 onboard flash) with a counterfeit one. A paper that further elaborates on the attack vectors is coming out in 2-3 months.

Thanks! Is this true for a somewhat-modern chip that has at least some slight attempt at defense, or more like the chip on a raspberry pi?

1Yonatan Cale
I asked Claude how relevant this is to protecting something like a H100, here are the parts that seem most relevant from my limited understanding:   What the paper actually demonstrates: 1.⁠ ⁠Reading (not modifying) data from antifuse memory in a Raspberry Pi RP2350 microcontroller 2.⁠ ⁠Using Focused Ion Beam (FIB) and passive voltage contrast to extract information   Key differences between this and modifying an H100 GPU: 1. 3D Transistor Structures: Modern 5nm chips use FinFET or GAAFET 3D structures rather than planar transistors. The critical parts are buried within the structure, making them fundamentally more difficult to access without destroying them. 2. Atomic-Scale Limitations: At 5nm, we're approaching atomic limits (silicon atoms are ~0.2nm). The physics of matter at this scale creates fundamental boundaries that better equipment cannot overcome. 3. Ion Beam Physics: Even with perfect equipment, ion beams create interaction volumes and damage zones that become proportionally larger compared to the target features at smaller nodes.

(Could you link to the context?)

Patching security problems in big old organizations involves problems that go a lot beyond "looking at code and changing it", especially if aiming for a "strong" solution like formal verification.

TL;DR: Political problems, code that makes no sense, problems that would be easy to fix even with a simple LLM that isn't specialized on improving security.

 

The best public resource I know is about this is Recoding America.

Some examples iirc:

  1. Not having a clear primary key to identify people with.
  2. Having a website (a form) that theoretically works but doesn't r
... (read more)

I want the tool to proactively suggest things while working on the document, optimizing for "low friction for getting lots of comments from the LLM". The tool you suggested does optimize for this property very well

2Milan W
I see. Friction management / affordance landscaping is indeed very important for interface UX design.
  1. This is very cool, thanks!
    1. I'm tempted to add Claude support
  2. It isn't exactly what I'm going for. Example use cases I have in mind:
    1. "Here's a list of projects I'm considering working on, and I'm adding curxes/considerations for each"
    2. "Here's my new alignment research agenda" (can an AI suggest places where this research is wrong? Seems like checking this would help the Control agenda?)
    3. "Here's a cost-effectiveness analysis of an org"
2Milan W
Seems like just pasting into the chat context / adding as attachments the relevant info on the default Claude web interface would work fine for those use cases.

Things I'd suggest to an AI lab CISO if we had 5 minutes to talk

 

1 minute version:

  • I think there are projects that can prepare the lab for moving to an air gapped network (protecting more than model weights) which would be useful to start early, would have minimal impact on developer productivity, and could be (to some extent) delegated[1]

 

Extra 4 minutes:

Example categories of such projects:

  1. Projects that take serial time but can be done without the final stage that actually hurts developer productivity
    1. Toy example: Add extra ethernet cables to the
... (read more)

I'm looking for an AI tool which feels like Google Docs but has an LLM proactively commenting/suggesting things.

(Is anyone else interested in something like this?)

1Rasool
I met someone in SF doing this but cannot remember the name of the company! If I remember I'll let you know One idea I thought would be cool related to this is to have several LLMs with different 'personalities' each giving different kinds of feedback. Eg. a 'critic', an 'aesthete', a 'layperson', so just like in Google Docs where you get comments from different people, here you can get inline feedback from different kinds of readers 
4Milan W
The nearest thing I can think of off the top of my head is the Pantheon interface. Probably more unconventional than what you had in mind, though.

This post helped me notice I have incoherent beliefs:

  1. "If MAGMA self-destructs, the other labs would look at it with confusion/pity and keep going. That's not a plan"
  2. "MAGMA should self-destruct now even if it's not leading!"

I think I've been avoiding thinking about this.

 

So what do I actually expect?

If OpenAI (currently in the lead) would say "our AI did something extremely dangerous, this isn't something we know how to contain, we are shutting down and are calling other labs NOT to train over [amount of compute], and are not discussing the algorithm p... (read more)

OpenAI already have this in their charter:

We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”

Similar opinion here, also noting they didn't run red-teaming and persuasion evals on the actually-final-version:

https://x.com/teortaxesTex/status/1885401111659413590

2Vladimir_Nesov
Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won't be redone, so it's equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.

I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.

This seems important because Control has a central goal of being a fallback for that kind of failure mode. 

I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).

What do I mean by Control generalizing out of distribution: 

Our Control experiment might involve giving a model 100 leetcode problems, ... (read more)

An opinion from a former lawyer 

[disclaimers: they're not an anti trust lawyer and definitely don't take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I'm not going to put any more disclaimers here, I hope I'm not also misremembering something]

So,

  1. Having someone from the U.S government sign that they won't enforce anti trust laws isn't enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enfo
... (read more)

To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors

I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.

 

Anthropic could at least avoid making substantially misleading statements

Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes

3ryan_greenblatt
Making more statements would also be fine! I wouldn't mind if there were just clarifying statements even if the original statement had some problems. (To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)

Hey,

In the article, you measured the MONA setup against a pure RL agent.

I'm curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won't[2] reward hack more than the MONA setup, but it's unclear to me if it would also have worse performance.

 

I could imagine discovering the myopic MONA agent converging on

  1. Picking whatever the overseer-AI would pick (since those actions would seem "obviously correct" to the overseer AI and result in the most reward)
  2. Picking actions that seem impressive but are
... (read more)
2Rohin Shah
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.) In Test-driven development the argument above isn't quite correct, because we prompted the agent to be a bad programmer but didn't do this with the reward, so the overseer is "stronger" than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn't yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide -- just whether an individual test is valid and correct -- so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.

nit: I wouldn't use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point[1].

 

  1. ^

    "we show how to get agents whose long-term plans follow strategies that humans can predict". But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model's steps will somehow lead to whatever is defined as a "good outcome", even if they don't understand how, which has similar problems to the RL reward from the future that you're trying to avoid.

2Rohin Shah
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer

 

Edit: I elaborated

  1. ^

    Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context

2Rohin Shah
If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago. But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

US Gov isn't likely to sign: Seems right.

OpenAI isn't likely to sign: Seems right.

Still, I think this letter has value, especially if it has "P.S. We're making this letter because we think if everyone keeps racing then there's a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly"

Dario said (Nov 2024):

"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]

If he doesn't believe this[1], I think he should clarify.

Hopefully cold take: People should say what ... (read more)

Changing the logic of chips is possible:

https://www.nanoscopeservices.co.uk/fib-circuit-edit/

Multi Layer Fix in FIB Circuit Edit

h/t @Jonathan_H from TemperSec

Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?

2Jonathan_H
TL;DR: Less than you think, likely < 1000 USD. The cost for renting such a machine (FIB) is 100-350 USD/h (depending on which university lab you choose). Some universities also offer to have one of their staff do the work for you (e.g., 165 USD/h at the University of Washington). The duration for a single modification is less than 1 hour.  Additionally, there is some non-FIB preparation time, which seems to be ~1 day if you do it for one chip; see here: https://arxiv.org/pdf/2501.13276).  I am currently mentoring a SPAR project that calculates more accurate numbers and maps them to specific attack scenarios. We plan to release our results in 2-3 months.

I don't understand Control as aiming to align a super intelligence:

  1. Control isn't expected to scale to ASI (as you noted, also see "The control approach we're imagining won't work for arbitrarily powerful AIs" here)
  2. We don't have a plan on how to align an ASI using Control afaik
    1. Ryan said around March 2024: "On (1) (not having a concrete plan for what to do with smarter systems), I think we should get such a plan". I'm still looking, this seems important.
      1. Edit: Maybe I'm wrong and Redwood does think that getting an ASI-alignment plan out of such an AI is possi
... (read more)
9lisathiergart
Speaking in my personal capacity as research lead of TGT (and not on behalf of MIRI), I think work in this direction is potentially interesting. One difficulty with work like this are anti-trust laws, which I am not familiar with in detail but they serve to restrict industry coordination that restricts further development / competition. It might be worth looking into how exactly anti-trust laws apply to this situation, and if there are workable solutions. Organisations that might be well placed to carry out work like this might be the frontier model forum and affiliated groups, I also have some ideas we could discuss in person. I also think there might be more legal leeway for work like this to be done if it's housed within organisations (government or ngos) that are officially tasked with defining industry standards or similar. 

We could also ask if these situations exist ("is there any funder you have that you didn't disclose?" and so on, especially around NDAs), and Epoch could respond with Yes/No/Can'tReply[1].

Also seems relevant for other orgs.

This would only patch the kind of problems we can easily think about, but it seems to me like a good start

 

  1. ^

    I learned that trick from hpmor!

Sounds like a legit pushback that I'd add to the letter?

"if all other labs sign this letter AND the U.S government approves this agreement, then we'll stop" ?

1Yonatan Cale
An opinion from a former lawyer  [disclaimers: they're not an anti trust lawyer and definitely don't take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I'm not going to put any more disclaimers here, I hope I'm not also misremembering something] So, 1. Having someone from the U.S government sign that they won't enforce anti trust laws isn't enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enforce it retroactively. This is similar to the current situation with Tiktok where Trump said he wouldn't enforce the law that prevents Google from having Tiktok on their app store, but Google still didn't put Tiktok back, probably because they're afraid that someone will change their mind and retroactively enforce the law 2. I asked if the government (e.g president) could sign "we won't enforce this, and if we change our mind we'll give a 3 month notice". 1. The former-lawyer's response was to consider whether, in a case the president would change their mind immediately, this signature would hold up in court. He thinks that not, but couldn't remember an example of something similar happening (which seems relevant) 3. If the law changes (for example, to exclude this letter), that works 1. (but it's hard to pass such changes through congress) 4. If the letter is conditional on the law changing, that seems ok My interpretation of this: It seems probably possible to find a solution where signing this letter is legal, but we'd have to consult with an anti-trust lawyer. [reminder that this isn't legal advice, isn't confident, is maybe misremembered, and so on]
2Alice Blair
Right now, the USG seems to very much be in [prepping for an AI arms race] mode. I hope there's some way to structure this that is both legal and does not require the explicit consent of the US government. I also somewhat worry that the US government does their own capabilities research, as hinted at in the "datacenters on federal lands" EO. I also also worry that OpenAI's culture is not sufficiently safety-minded right now to actually sign onto this; most of what I've been hearing from them is accelerationist.

Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.

(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)

This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".

But also, if that stage isn't automated, then I'... (read more)

I'm trying the same! You have my support

My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:

 

“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).

The high level goal is to still get useful work out of this AI.

[what to do with this work? below]

 

In scope: AIs that could potentially cause seri... (read more)

Do we want to put out a letter for labs to consider signing, saying something like "if all other labs sign this letter then we'll stop"?

 

I heard lots of lab employees hope the other labs would slow down.

 

I'm not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like "if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we'll stop until all the labs who signed this agree it's safe to continue". Or something else. It would be nic... (read more)

1Yonatan Cale
OpenAI already have this in their charter:
9ryan_greenblatt
In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here.
1Yonatan Cale
Maybe a good fit for Machine Intelligence Research Institute (MIRI) to flesh out and publish?
5MondSemmel
Law question: would such a promise among businesses, rather than an agreement mandated by / negotiated with governments, run afoul of laws related to monopolies, collusion, price gouging, or similar?

If verification is placed sufficiently all over the place physically, it probably can't be circumvented

Thanks! Could you say more about your confidence in this?

 

the chip needs some sort of persistent internal clocks or counters that can't be reset

Yes, specifically I don't want an attacker to reliably be able to reset it to whatever value it had when it sent the last challenge.

If the attacker can only reset this memory to 0 (for example, by unplugging it) - then the chip can notice that's suspicious.

Another option is a reliable wall clock (though this ... (read more)

2Vladimir_Nesov
Training frontier models needs a lot of chips, situations where "a chip notices something" (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can't be made into part of the datacenter). AI accelerator chips have 80B+ transistors, much more than an instance of certificate verification circuitry would need, so you can place multiple of them (and have them regularly recheck the certificates). There are EUV pitch metal connections several layers deep within a chip, you'd need to modify many of them all over the chip without damaging the layers above, so I expect this to be completely infeasible to do for 10K+ chips on general principle (rather than specific knowledge of how any of this works). For clocks or counters, I guess AI accelerators normally don't have any rewritable persistent memory at all, and I don't know how hard it would be to add some in a way that makes it too complicated to keep resetting automatically.

I agree that we won't need full video streaming, it could be compressed (most of the screen doesn't change most of the time), but I gave that as an upper bound.

If you still run local computation, you lose out on some of the advantages I mentioned.

(If remote vscode is enough for someone, I definitely won't be pushing back)

Some hands-on experience with software development without an internet connection, from @niplav , which seems somewhat relevant :

https://www.lesswrong.com/posts/jJ9Hx8ETz5gWGtypf/how-do-you-deal-w-super-stimuli?commentId=3KnBTp6wGYRfgzyF2

Off switch / flexheg / anti tampering:

Putting the "verifier" on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)

I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.

If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which s... (read more)

2Yonatan Cale
Changing the logic of chips is possible: https://www.nanoscopeservices.co.uk/fib-circuit-edit/ h/t @Jonathan_H from TemperSec Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?
3Vladimir_Nesov
Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can't be circumvented. I'm guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can't be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.

For profit AI Security startup idea:

TL;DR: A laptop that is just a remote desktop ( + why this didn't work before and how to fix that)

 

Why this is nice for AI Security:

  1. Reduces the amount of GPUs that will be sent to customers
  2. Somewhat better security for this laptop since it's a bit more like defending a cloud computer. Maybe labs would use this to make the employee computers more secure?

 

Network spikes: A reason this didn't work before and how to solve it

The problem: Sometimes the network will be slow for a few seconds. It's really annoying if th... (read more)

2Nathan Helm-Burger
I think this gets a lot easier if you drop the idea of a 'full remote computer' and instead embrace the idea of just the key data points move. More like the remote VS Code server or Jupyter Notebook server, being accessed from a Chromebook. All work files would stay saved there, all experiments run from the server (probably by being sent as tasks to yet a third machine.) Locally, you couldn't save any files, but you could do (for instance) web browsing. The web browsing could be made extra secure in some way.

For-profit startup idea: Better KYC for selling GPUs

I heard[1] that right now, if a company wants to sell/resell GPUs, they don't have a good way to verify that selling to some customer wouldn't violate export controls, and that this customer will (recursively) also keep the same agreement.

There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren't easy for nation states to bypass), and are profitable enough that many companies exist that give KYC servi... (read more)

Anti Tempering in a data center you control provides very different tradeoffs

 

I'll paint a picture for how this could naively look: 

We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I'm assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.

 

How do the tradeoffs change? What creative things could we do with our new assumptions?

  1. Tradeoffs we don't really care about anymore:
    1. We don't need the anti tampering to re
... (read more)

Do you think this hints at "doing engineering in an air gapped network can be made somewhat reasonable"?

(I'm asking in the context of securing AI labs' development environments. Random twist, I know)

5niplav
I think especially if you have a competent coding LLM in the air-gapped network, then probably yes, if you mean software engineering. The biggest bottlenecks to me look like having nice searchable documentation for libraries the LLM doesn't know well—the documentation for most projects can't easily be downloaded, and the ones for which it can be downloaded easily aren't in a nicely searchable format. Gitbook isn't universal (yet. Growth mindset). (Similar with the kiwix search—you need to basically know what you're looking for, or you won't find it. E.g. I was trying to think of the Nazi Bureaucrat who in the Sino-Japanese war in the 1930s had rescued many Chinese from getting killed in warcrimes, but couldn't think of it—until LLaMa-2-13b (chat) told me the name was John Rabe—but I've despaired over slightly more obscure questions.) A well-resourced actor could try to clone the Stackoverflow content and their search, or create embeddings for a ton of documentation pages of different software packages. That'd make it much nicer. Also, a lot of software straight up doesn't work without an internet connection—see e.g. the incident where people couldn't do arithmetic in Elm without an internet connection. Thankfully it's the exception rather than the norm.

Oh yes the toll unit needs to be inside the GPU chip imo.

 

why do I let Nvidia send me new restrictive software updates?

Alternatively the key could be in the central authority that is supposed to control the off switch. (same tech tho)

 

Why don't I run my GPUs in an underground bunker, using the old most broken firmware?

Nvidia (or whoever signs authorization for your GPU to run) won't sign it for you if you don't update the software (and send them a proof you did it using similar methods, I can elaborate).

The interesting/challenging technical parts seem to me:

1. Putting the logic that turns off the GPU (what you called "the toll unit") in the same chip as the GPU and not in a separate chip

2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it's signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don't need to consider all our ideas (limit bandwidth? limit gps location?) in advance.

 

Things that s... (read more)

4robo
Thanks!  I'm not a GPU expert either.  The reason I want to spread the toll units inside GPU itself isn't to turn the GPU off -- it's to stop replay attacks.  If the toll thing is in a separate chip, then the toll unit must have some way to tell the GPU "GPU, you are cleared to run".  To hack the GPU, you just copy that "cleared to run" signal and send it to the GPU.  The same "cleared to run" signal must always make the GPU work, unless there is something inside the GPU to make sure won't accept the same "cleared to run" signal twice.  That the point of the mechanism I outline -- a way to make it so the same "cleared to run" signal for the GPU won't work twice. Hmm okay, but why do I let Nvidia send me new restrictive software updates?  Why don't I run my GPUs in an underground bunker, using the old most broken firmware?

I love the direction you're going with this business idea (and with giving Nvidia a business incentive to make "authentication" that is actually hard to subvert)!

I can imagine reasons they might not like this idea, but who knows. If I can easily suggest this to someone from Nvidia (instead of speculating myself), I'll try

I'll respond to the technical part in a separate comment because I might want to link to it >>

2Yonatan Cale
The interesting/challenging technical parts seem to me: 1. Putting the logic that turns off the GPU (what you called "the toll unit") in the same chip as the GPU and not in a separate chip 2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it's signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don't need to consider all our ideas (limit bandwidth? limit gps location?) in advance.   Things that seem obviously solvable (not like the hard part) : 3. The cryptography 4. Turning off a GPU somehow (I assume there's no need to spread many toll units, but I'm far from a GPU expert so I'd defer to you if you are)

More on starting early:

Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home.

If that problem comes up now (early), then we can say "okay, working from home is allowed", and we'll add that problem to the queue of things that we'll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee's house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else.

I... (read more)

Yeah it will compromise productivity.

I hope we can make the compromise not too painful. Especially if we start early and address all the problems that will come up before we're in the critical period where we can't afford to mess up anymore.

I also think it's worth it

1Yonatan Cale
More on starting early: Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home. If that problem comes up now (early), then we can say "okay, working from home is allowed", and we'll add that problem to the queue of things that we'll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee's house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else. If that problem comes up when security is critical (if we wait), then the solution will be "no more working from home, period". The security staff will be too overloaded with other problems to solve, not available to experiment with having another office nor to sign a deal with Cursor.

I don't think this is too nuanced for a lab that understands the importance of security here and wants a good plan (?)

Ah, interesting

Still, even if some parts of the architecture are public, it seems good to keep many details private, details that took the lab months/years to figure out? Seems like a nice moat

5ryan_greenblatt
Yes, ideally, but it might be hard to have a relatively more nuanced message. Like in the ideal world an AI company would have algorithmic secret security, disclose things that passed cost-benefit for disclosure, and generally do good stuff on transparancy.

Some hard problems with anti tampering and their relevance for GPU off-switches

 

Background on GPU off switches:

It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].

 

This shortform is mostly replying to ideas like "we'll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100's anti-tampering mechanism will self-destruct (melt? explode?)"... (read more)

3Yonatan Cale
Anti Tempering in a data center you control provides very different tradeoffs   I'll paint a picture for how this could naively look:  We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I'm assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.   How do the tradeoffs change? What creative things could we do with our new assumptions? 1. Tradeoffs we don't really care about anymore: 1. We don't need the anti tampering to reliably work (it's nice if it works, but it now becomes "defense in depth") 2. Slowing down the attacker is already very nice 3. Our box can be maintainable 4. We don't have to find all bugs in advance 5. ... 2. "Noticing the breach" becomes an important assumption 1. Does our data center have cameras? What if they are hacked? And so on 1. (An intuition I hope to share: This problem is much easier than "preventing the breach") 2. It doesn't have to be in "our" data center. It could be in a "shared" data center that many parties monitor. 3. Any other creative solution to notice breaches might work 4. How about spot inspections to check if the box was tampered with? 1. "preventing a nation state from opening a big and closing it without any visible change, given they can do whatever they want with the box, and given the design is open source" seems maybe very hard, or maybe a solved problem. 3. "If a breach is noticed then something serious happens" becomes an important assumption 1. Are the stakeholders on board?   Things that make me happy here: 1. Less hard assumptions to make 2. Less difficult tradeoffs to balance 3. The entire project requires less world class cutting edge engineering.
2robo
I used to assume disabling a GPU in my physical possession would be impossible, but now I'm not so sure.  There might be ways to make bypassing GPU lockouts on the order of difficulty of manufacturing the GPU (requiring nanoscale silicon surgery). Here's an example scheme: Nvidia changes their business models from selling GPUs to renting them.  The GPU is free, but to use your GPU you must buy Nvidia Dollars from Nvidia.  Your GPU will periodically call Nvidia headquarters and get an authorization code to do 10^15 more floating point operations.  This rental model is actually kinda nice for the AI companies, who are much more capital constrained than Nvidia.  (Lots of industries have moved from this buy to rent model, e.g. airplane engines) Question: "But I'm an engineer.  How (the hell) could Nvidia keep me from hacking a GPU in my physical possession to bypass that Nvidia dollar rental bullshit?" Answer: through public key cryptography and the fact that semiconductor parts are very small and modifying them is hard. In dozens to hundreds or thousands of places on the GPU, NVidia places toll units that block signal lines (like ones that pipe floating point numbers around) unless the toll units believe they have been paid with enough Nvidia dollars. The toll units have within them a random number generator, a public key ROM unique to that toll unit, a 128 bit register for a secret challenge word, elliptic curve cryptography circuitry, and a $$$ counter which decrements every time the clock or signal line changes. If the $ $ $ counter is positive, the toll unit is happy and will let signals through unabated.  But if the $ $ $ counter reaches zero,[1] the toll unit is unhappy and will block those signals. To add to the $$$ counter, the toll unit (1) generates a random secret <challenge word>, (2) encrypts the secret using that toll unit's public key (3) sends <encrypted secret challenge word> to a non-secure parts of the GPU,[2] which (4) through driver softwar

"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.

I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.

 

Anyway, common pushbacks:

  1. "Employees move between companies and we can't prevent them sharing what they know": In the IDF we had secrets in our air gapped network which people didn't share because they understood
... (read more)
3Yonatan Cale
Some hands-on experience with software development without an internet connection, from @niplav , which seems somewhat relevant : https://www.lesswrong.com/posts/jJ9Hx8ETz5gWGtypf/how-do-you-deal-w-super-stimuli?commentId=3KnBTp6wGYRfgzyF2
3Bary Levy
I think that as AI tools become more useful, working in an air gapped network is going to require a larger compromise in productivity. Maybe AI labs are the exception here as they can deploy their own products in the air gapped network, but that depends on how much of the productivity gains they can replicate using their own products. i.e. an Anthropic employee might not be able to use Cursor unless Anthropic signs a deal with Cursor to deploy it inside the network. Now do this with 10 more products, this requires infrastructure and compute that might be just too much of a hassle for the company.
8ryan_greenblatt
The case against a focus on algorithmic secret security is that this will emphasize and excuse a lower level of transparency which is potentially pretty bad. Edit: to be clear, I'm uncertain about the overall bottom line.

Are you interested in having a prediction market about this that falls back on your judgement if the situation is unclear?

Something like "If it's publicly known that an AI lab 'caught the AI red handed' (in the spirit of Redwood's Control agenda), will the lab temporarily shut down as Redwood suggested (as opposed to applying a small patch and keep going)?"

Ryan and Buck wrote:

> The control approach we're imagining won't work for arbitrarily powerful AIs

 

Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?

 

Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.

 

Opinions I’ve heard so far:

  1. Solve ELK / mechanistic anomaly detection / something else that ARC suggested
  2. Let the AI come up with alignment plans such as ELK that would be
... (read more)
5Seth Herd
I think all 7 of those plans are far short of adequate to count as a real plan. There are a lot of more serious plans out there, but I don't know where they're nicely summarized. What’s the short timeline plan? poses this question but also focuses on control, testing, and regulation - almost skipping over alignment. Paul Christiano's and Rohin Shah's work are the two most serious. Neither of them have published a "this is the plan" concise statement, and both have probably substantially updated their plans. These are the standard-bearers for "prosaic alignment" as a real path to alignment of AGI and ASI. There is tons of work on aligning LLMs, but very little work AFAICT on how and whether that extends to AGIs based on LLMs. That's why Paul and Rohin are the standard bearers despite not working publicly directly on this for a few years. I work primaily on this, since I think it's the most underserved area of AGI x-risk - aligning the type of AGI people are most likely to build on the current path. My plan can perhaps be described as extending prosaic alignment to LLM agents with new techniques, and from there to real AGI. A key strategy is using instruction-following as the alignment target. It is currently probably best summarized in my response to "what's the short timeline plan?"
5Nathan Helm-Burger
Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria. Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.
8quetzal_rainbow
I think if you have "minimally viable product", you can speed up davidad's Safeguarded AI and use it to improve interpretability.

I think this should somewhat update people away from "we can prevent model weights from being stolen by limiting the outgoing bandwidth from the data center", if that protection is assuming that model weights are very big and [the dangerous part] can't be made smaller.

I'd also bet that, even if Deep Seek turns out to be somehow "fake" (optimized for benchmarks in some way) (not that this currently seems like the situation), some other way of making at least the dangerous[1] parts of a model much smaller[2] will be found and known[3] publicly... (read more)

9ryan_greenblatt
Yes, but I think the larger update is that recent models from OpenAI are likely quite small and inference time compute usage creates more an incentive for small models. It seems likely that (e.g.) o1-mini is quite small given that it generates at 220 tokens per second(!), perhaps <30 billion active parameters based on the link from epoch given earlier. I'd guess (idk) 100 billion params. Likely something similar holds for o3-mini. (I think the update from deepseek in particular might be smaller than you think as export controls create an artifical incentive for smaller models.)
Load More