This is a brief post arguing that, although "side-channels are inevitable" is pretty good common advice, actually, you can prevent attackers inside a computation from learning about what's outside.
We can prevent a task-specific AI from learning any particular facts about, say, human psychology, virology, or biochemistry—if:
- we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model
- we use relatively elementary sandboxing (no clock access, no networking APIs, no randomness, none of these sources of nondeterminism, error-correcting RAM, and that’s about it)
I don't think either of these happens by default and if you are in an AGI lab I suggest you advocate for either (or both if you can, but one at a time is good too).
Regarding item 1, self-play in Go is an example par excellence and this may be one reason why people tend to have a strong intuition that arbitrarily strong AlphaZero fails to kill you. An example that trades off more potential risk for engineering applicability would be inputs from a Newtonian physics simulator with semi-rigid objects of randomly generated shapes.
I think it is good to be aware that
- this is a real tradeoff; we don't need to just assume that superintelligent models will be able to deduce whatever physical facts they want through pure logical induction
- taking advantage of the argument here would require being careful about data
- datasets like CommonCrawl, MassiveText, and YouTube-8M are, uh, not careful
Arguments
If an attacker on the outside of a computation is trying to manipulate events inside the computation, they get to use any and all physical mechanisms: they can open up the computer, stick some electrodes on the memory bus, and have at it.
If an attacker is inside a pure mathematical function and trying to manipulate events outside the computer, well, they can influence the output of the function (or even, potentially, whether the function is tractable to compute). If the output of the computation is propagated into human minds, this influence could have cascading effects that are hard to avoid or even notice.
If an attacker is trying to exfiltrate information from a computer, they get to use any and all physical mechanisms: they can run a very accurate clock, use a sensitive microphone, measure magnetic fields, etc. For better or worse, the nature of our physics is that everything that happens tends to radiate information in all directions and it's very difficult to fully neutralise; physical distance from attackers is the best defence, and even that is far from perfect (for example, timing information still propagates as far away as outputs do).
If an attacker is inside a pure mathematical function and trying to exfiltrate information from outside the computer, they do not get to use any physics. That same mathematical function is, presumably, also being computed in various places all over the multiverse. Only the explicitly given inputs—and what can be inferred from them on a Solomonoff-style prior—narrow it down. If a deterministic function is computed correctly, its result cannot depend further on the specific physical properties of the computation or facts about the environment in which it was running. All the attacker can learn is that, if they are in a simulation at all, the simulator has at least a certain amount of computational resources.[1]
- ^
In fact, maybe they cannot be very sure of this either, since who knows what kind of fancy compiler-optimisation static-analysis techniques are out there in the universe that's simulating them. They could try doing some cryptographic operations that are supposed to be irreducibly expensive, but we tend to have a hard time proving those sorts of computational hardness assumptions. Or maybe the simulators can just fudge those parts of the computation. I'm not confident either way about how much the attacker can learn about the available computational resources.
Deterministic (but complicated) code can leak info about processor architecture.
Firstly even simple integer overflow and floating point precision indicate being on a 32 or 64 bit based system.
Suppose you have a sum function. And this function sums a list by getting each of the 8 cores to add up 1/8th of the list, and then combining the results. The floating point roundoff errors reveal this internal structure.
Add some other things like that, and it is now possible for your AI to guess your computer architecture. Is this enough to guess where a buffer overflow attack might work? Maybe. Maybe there aren't that many different processor architectures running code like it in the multiverse.
I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan. A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info.