I asked Claude how relevant this is to protecting something like a H100, here are the parts that seem most relevant from my limited understanding:
1. Reading (not modifying) data from antifuse memory in a Raspberry Pi RP2350 microcontroller
2. Using Focused Ion Beam (FIB) and passive voltage contrast to extract information
Patching security problems in big old organizations involves problems that go a lot beyond "looking at code and changing it", especially if aiming for a "strong" solution like formal verification.
TL;DR: Political problems, code that makes no sense, problems that would be easy to fix even with a simple LLM that isn't specialized on improving security.
The best public resource I know is about this is Recoding America.
Some examples iirc:
Things I'd suggest to an AI lab CISO if we had 5 minutes to talk
Example categories of such projects:
This post helped me notice I have incoherent beliefs:
I think I've been avoiding thinking about this.
So what do I actually expect?
If OpenAI (currently in the lead) would say "our AI did something extremely dangerous, this isn't something we know how to contain, we are shutting down and are calling other labs NOT to train over [amount of compute], and are not discussing the algorithm p...
OpenAI already have this in their charter:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”
Similar opinion here, also noting they didn't run red-teaming and persuasion evals on the actually-final-version:
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, ...
An opinion from a former lawyer
[disclaimers: they're not an anti trust lawyer and definitely don't take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I'm not going to put any more disclaimers here, I hope I'm not also misremembering something]
So,
To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors
I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.
Anthropic could at least avoid making substantially misleading statements
Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes
Hey,
In the article, you measured the MONA setup against a pure RL agent.
I'm curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won't[2] reward hack more than the MONA setup, but it's unclear to me if it would also have worse performance.
I could imagine discovering the myopic MONA agent converging on
nit: I wouldn't use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point[1].
"we show how to get agents whose long-term plans follow strategies that humans can predict". But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model's steps will somehow lead to whatever is defined as a "good outcome", even if they don't understand how, which has similar problems to the RL reward from the future that you're trying to avoid.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer
Edit: I elaborated
Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context
US Gov isn't likely to sign: Seems right.
OpenAI isn't likely to sign: Seems right.
Still, I think this letter has value, especially if it has "P.S. We're making this letter because we think if everyone keeps racing then there's a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly"
Dario said (Nov 2024):
"I never liked those words [P(DOOM)], I think they're kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down" [lightly edited for clarity, bold is mine]
If he doesn't believe this[1], I think he should clarify.
Hopefully cold take: People should say what ...
Changing the logic of chips is possible:
https://www.nanoscopeservices.co.uk/fib-circuit-edit/
h/t @Jonathan_H from TemperSec
Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?
I don't understand Control as aiming to align a super intelligence:
We could also ask if these situations exist ("is there any funder you have that you didn't disclose?" and so on, especially around NDAs), and Epoch could respond with Yes/No/Can'tReply[1].
Also seems relevant for other orgs.
This would only patch the kind of problems we can easily think about, but it seems to me like a good start
I learned that trick from hpmor!
Phishing emails might have bad text on purpose, so that security-aware people won't click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn't they proof read their one SMS? Well, sometimes they can't, but sometimes it's probably on purpose)
This tradeoff could change if AIs could automate the stage of "speaking to a human scammer".
But also, if that stage isn't automated, then I'...
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause seri...
Do we want to put out a letter for labs to consider signing, saying something like "if all other labs sign this letter then we'll stop"?
I heard lots of lab employees hope the other labs would slow down.
I'm not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like "if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we'll stop until all the labs who signed this agree it's safe to continue". Or something else. It would be nic...
If verification is placed sufficiently all over the place physically, it probably can't be circumvented
Thanks! Could you say more about your confidence in this?
the chip needs some sort of persistent internal clocks or counters that can't be reset
Yes, specifically I don't want an attacker to reliably be able to reset it to whatever value it had when it sent the last challenge.
If the attacker can only reset this memory to 0 (for example, by unplugging it) - then the chip can notice that's suspicious.
Another option is a reliable wall clock (though this ...
I agree that we won't need full video streaming, it could be compressed (most of the screen doesn't change most of the time), but I gave that as an upper bound.
If you still run local computation, you lose out on some of the advantages I mentioned.
(If remote vscode is enough for someone, I definitely won't be pushing back)
Some hands-on experience with software development without an internet connection, from @niplav , which seems somewhat relevant :
Off switch / flexheg / anti tampering:
Putting the "verifier" on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which s...
For profit AI Security startup idea:
TL;DR: A laptop that is just a remote desktop ( + why this didn't work before and how to fix that)
Why this is nice for AI Security:
Network spikes: A reason this didn't work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It's really annoying if th...
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don't have a good way to verify that selling to some customer wouldn't violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren't easy for nation states to bypass), and are profitable enough that many companies exist that give KYC servi...
Anti Tempering in a data center you control provides very different tradeoffs
I'll paint a picture for how this could naively look:
We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I'm assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.
How do the tradeoffs change? What creative things could we do with our new assumptions?
Oh yes the toll unit needs to be inside the GPU chip imo.
why do I let Nvidia send me new restrictive software updates?
Alternatively the key could be in the central authority that is supposed to control the off switch. (same tech tho)
Why don't I run my GPUs in an underground bunker, using the old most broken firmware?
Nvidia (or whoever signs authorization for your GPU to run) won't sign it for you if you don't update the software (and send them a proof you did it using similar methods, I can elaborate).
The interesting/challenging technical parts seem to me:
1. Putting the logic that turns off the GPU (what you called "the toll unit") in the same chip as the GPU and not in a separate chip
2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it's signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don't need to consider all our ideas (limit bandwidth? limit gps location?) in advance.
Things that s...
I love the direction you're going with this business idea (and with giving Nvidia a business incentive to make "authentication" that is actually hard to subvert)!
I can imagine reasons they might not like this idea, but who knows. If I can easily suggest this to someone from Nvidia (instead of speculating myself), I'll try
I'll respond to the technical part in a separate comment because I might want to link to it >>
More on starting early:
Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home.
If that problem comes up now (early), then we can say "okay, working from home is allowed", and we'll add that problem to the queue of things that we'll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee's house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else.
I...
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like "we'll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100's anti-tampering mechanism will self-destruct (melt? explode?)"...
"Protecting model weights" is aiming too low, I'd like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it's outside the Overton Window to even suggest this and I'm not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
Are you interested in having a prediction market about this that falls back on your judgement if the situation is unclear?
Something like "If it's publicly known that an AI lab 'caught the AI red handed' (in the spirit of Redwood's Control agenda), will the lab temporarily shut down as Redwood suggested (as opposed to applying a small patch and keep going)?"
Ryan and Buck wrote:
> The control approach we're imagining won't work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
I think this should somewhat update people away from "we can prevent model weights from being stolen by limiting the outgoing bandwidth from the data center", if that protection is assuming that model weights are very big and [the dangerous part] can't be made smaller.
I'd also bet that, even if Deep Seek turns out to be somehow "fake" (optimized for benchmarks in some way) (not that this currently seems like the situation), some other way of making at least the dangerous[1] parts of a model much smaller[2] will be found and known[3] publicly...
Seems like Unicode officially added a "person being paperclipped" emoji:
Here's how it looks in your browser: 🙂↕️
Whether they did this as a joke or to raise awareness of AI risk, I like it!
Source: https://emojipedia.org/emoji-15.1