This is a brief post arguing that, although "side-channels are inevitable" is pretty good common advice, actually, you can prevent attackers inside a computation from learning about what's outside.
We can prevent a task-specific AI from learning any particular facts about, say, human psychology, virology, or biochemistry—if:
- we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model
- we use relatively elementary sandboxing (no clock access, no networking APIs, no randomness, none of these sources of nondeterminism, error-correcting RAM, and that’s about it)
I don't think either of these happens by default and if you are in an AGI lab I suggest you advocate for either (or both if you can, but one at a time is good too).
Regarding item 1, self-play in Go is an example par excellence and this may be one reason why people tend to have a strong intuition that arbitrarily strong AlphaZero fails to kill you. An example that trades off more potential risk for engineering applicability would be inputs from a Newtonian physics simulator with semi-rigid objects of randomly generated shapes.
I think it is good to be aware that
- this is a real tradeoff; we don't need to just assume that superintelligent models will be able to deduce whatever physical facts they want through pure logical induction
- taking advantage of the argument here would require being careful about data
- datasets like CommonCrawl, MassiveText, and YouTube-8M are, uh, not careful
Arguments
If an attacker on the outside of a computation is trying to manipulate events inside the computation, they get to use any and all physical mechanisms: they can open up the computer, stick some electrodes on the memory bus, and have at it.
If an attacker is inside a pure mathematical function and trying to manipulate events outside the computer, well, they can influence the output of the function (or even, potentially, whether the function is tractable to compute). If the output of the computation is propagated into human minds, this influence could have cascading effects that are hard to avoid or even notice.
If an attacker is trying to exfiltrate information from a computer, they get to use any and all physical mechanisms: they can run a very accurate clock, use a sensitive microphone, measure magnetic fields, etc. For better or worse, the nature of our physics is that everything that happens tends to radiate information in all directions and it's very difficult to fully neutralise; physical distance from attackers is the best defence, and even that is far from perfect (for example, timing information still propagates as far away as outputs do).
If an attacker is inside a pure mathematical function and trying to exfiltrate information from outside the computer, they do not get to use any physics. That same mathematical function is, presumably, also being computed in various places all over the multiverse. Only the explicitly given inputs—and what can be inferred from them on a Solomonoff-style prior—narrow it down. If a deterministic function is computed correctly, its result cannot depend further on the specific physical properties of the computation or facts about the environment in which it was running. All the attacker can learn is that, if they are in a simulation at all, the simulator has at least a certain amount of computational resources.[1]
- ^
In fact, maybe they cannot be very sure of this either, since who knows what kind of fancy compiler-optimisation static-analysis techniques are out there in the universe that's simulating them. They could try doing some cryptographic operations that are supposed to be irreducibly expensive, but we tend to have a hard time proving those sorts of computational hardness assumptions. Or maybe the simulators can just fudge those parts of the computation. I'm not confident either way about how much the attacker can learn about the available computational resources.
I agree that, even for side-channels exposing external information to a mathematical attacker, we cannot get this absolutely perfect. Error-correction in microelectronics is an engineering problem and engineering is never absolutely fault-free.
However, per this recent US government study, RAM error rates in high-performance compute clusters range from 0.2 to 20 faults per billion device-hours. For comparison, training GPT-3 (175B parameters) from scratch takes roughly 1-3 million device-hours. An attacker inside a deep learning training run probably gets zero bits of information via the RAM-error channel.
But suppose they get a few bits. Those bits are about as random as they come. Nor is there anything clever to do from within an algorithm to amplify the extent to which cosmic rays reflect useful information about life on Earth.
I disbelieve that your claims about real-world exploits, if cashed out, would break the abstraction of deterministic execution such as is implemented in practice for blockchain smart contracts.
I do think it’s prudent to use strong hardware and software error-correction techniques in high-stakes situations, such as advanced AI, but mostly because errors are generally bad, for reliability and ability to reason about systems and their behaviours. The absolute worst would be if the sign bit got flipped somewhere in a mesa-optimiser’s utility function. So I’m not saying we can just completely neglect concerns about cosmic rays in an AI safety context. But I am prepared to bet the farm on task-specific AIs being completely unable to learn any virology via side-channels if the AI lab training it musters a decent effort to be careful about deterministic execution (which, I stress again, is not something I think happens by default—I hope this post has some causal influence towards making it more likely).