Towards Non-Panopticon AI Alignment

Logan Zoellner

List of reasons to be concerned about AI are notoriously hard to read. Worse, most of them include an equal mix of bad arguments (Foom is likely, AI development will mimic human evolution, the AI will unbox itself by solving the protein folding problem) and good ones.

So, here is my very short (hopefully very easy to read) list of reasons why AI could be dangerous:

It might be hard to explain human values to AI
Even if AI understands Human values, it might do something bad by accident
Even if it is possible to build Safe AI, people might deploy an unsafe AI because we are all racing
Even if Safe AI is widely deployed, some people might do very bad things since AI gives us all superpowers

We can call the first 2 points the "individual alignment problem", because they are problems anyone building an AGI will have to solve. And we can call points 3. and 4. "cooperative AGI problems" because even if we know how to to build safe AGI, the outcome could still be bad if we fail to cooperate.

For this post, I want to focus on problems 3. and 4. If it helps, imagine that we live in a world where future variants of RLAIF and Corrigible AI have perfectly solved problems 1. and 2.

The Global Panopticon

There is a very common (but ultimately flawed) approach that frequently comes up when discussing solutions to 3. and 4.

The naive solution is:

Come up with a limit defining the boundary between "safe" and "unsafe" AI models
Set up a global monitoring/enforcement regime to ensure no one violates this boundary

We can see many examples of this "define and monitor" solution. For example calling for monitoring on large GPU clusters or drone strikes on rogue data centers. Indeed, these approaches are often explicitly models after controls on nuclear material, a regime which should be most famous for its complete failure to prevent nations from acquiring nuclear weapons. Define and monitor is the go-to solution for most global coordination problems such as CFCs, Greenhouse Gasses, missiles, military equipment and genocide.

In addition to failing to achieve its stated goals in almost all of these cases (CFCs being the notable exception), another downside of a global define-and-monitor strategy is its implications for privacy. As Moore's Law and algorithmic improvements inexorably move forward, the "safe" limits on AGI will have to be continually revised downwards until eventually all personal computers will be monitored.

In the end, a "define and monitor" solution leads to an incredibly fragile world with no privacy. It is also most naturally implemented as a centralized totalitarian AGI, whose values will inevitably be imposed on others.

Can we do better than "Define and Monitor"?

Classic solutions to the prisoners dilemma are punishment by a central authority, use of expensive pre-commitments, and transforming the prisoner's dilemma into an iterated prisoner's dilemma.

With the advance of cryptography, however, a new class of solutions to coordination problems is gradually beginning to emerge. Just as cells signal to the immune system that they belong to the body, we can imagine that individual agents might signal to a decentralized blockchain that they are staying within safe limits using zero-knowledge proofs.

cells signal to the immune system that the belong to the body

The exact details of what message an agent would pass to the blockchain depend on the details of our definition of "safe" and "unsafe" AGI. At one extreme, suppose there was a list of "approved" source codes for AGI, then the agent could simply send a hash of it's source code which is then checked against a whitelist.

Suppose, instead that we implement corrigibility as a "tax on actions that reduce the set of possible futures", then the agent would be required to signal that it had correctly calculated (and paid) its taxes.

As a third possibility, imagine that we had some way of defining "acceptable" and "unacceptable" utility functions. For example acceptable utility functions might be required to comply with certain rules such as "do not harm humans". It may be possible for an agent to prove that its utility function fell within the set of acceptable utility functions.

Conclusion

Solving the Cooperative Alignment Problem requires finding a way to make all agents coordinate to avoid bad outcomes. A naive solution to coordination problems is a "define and monitor" regime where we define limits on safe AGI and then monitor all agents to verify they are within these limits. However such a system is inherently totalitarian, fragile, and incompatible with privacy. Better solutions are desperately needed and using zero knowledge proofs and a decentralized blockchain is one possible area we could research for such solutions.

LESSWRONG
LW

7

Towards Non-Panopticon AI Alignment

7

The Global Panopticon

Can we do better than "Define and Monitor"?

Conclusion

7