Posts

Sorted by New

Wiki Contributions

Comments

Fortunately, for coarse "guardrails" the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren't memory leaks, 2) there aren't out-of-bounds memory accesses, 3) there aren't race conditions, 4) there aren't type violations, 5) there aren't buffer overflows, 6) private information is not exposed by the program, 7) there aren't infinite loops, etc. There should be a widely used "metaspec" for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn't crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I'm a fan of "Digital Twins" of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I'm just reading Anthropic's recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn't eliminate the need to ensure that misaligned AGI and malicious AGI don't cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don't implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans. 

Peter, thanks again for starting this discussion! Just a few caveats in your summary. We don't depend on trustable AIs! One of the absolutely amazing and critical characteristics of mathematical proof is that anybody, trusted or not, can create proofs and we can check them without any need to trust the creator. For example, MetaMath https://us.metamath.org/ defines an especially simple system for which somebody has written a 350 line Python proof checker which can check all 40,000 MetaMath theorems in a few seconds. We need to make sure that that small Python program is correct but beyond that don't need to trust any AIs! It certainly would be nice to have trustable AIs but I think it is a mistake to depend on them. We are already seeing a wide range of open source AIs created by different groups with different intentions. And there are many groups working to remove the safety features of commercial Large Language Models and text-to-image models. 

The core of our proposal is to put provable guardrails around dangerous technologies that AGI might seek to control. As many commenters have pointed out we need to figure out what those are! Some current examples include DNA synthesis machines, nukes, military hardware of all kinds, drones capable of dispersing dangerous payloads, etc.  And we need to create rules for which actions with these systems are safe! But we already do that today for human use. The CDC has detailed rules for biohazard labs. The initial implementation of our proposal will most likely be digitally checkable versions of those existing human rules. But certainly we need to extend them to the new environment as AIs become more powerful. Any action remotely capable of human extinction should become a primary focus. I believe establishing the rules for high risk actions should be an immediate priority for humanity. And extending those rules to other systems to enable human flourishing should be a close second.

We can use untrusted AIs in much of this work! We simply require them to generate formal proofs of any designs or software that they create. These can then be easily and rigorously checked and they can be used without concern for the properties of the AI which generated them.

Check out this great paper: "From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought" https://arxiv.org/abs/2306.12672 It proposes "probabilistic programming" as a formal "Probabilistic Language of Thought" (PLoT) with precise formal Bayesian reasoning. They show in 4 domains how a large language model can convert an informal statement or chain of reasoning into a precise probabilistic program, do precise Bayesian reasoning on that, and then convert the results back into informal natural language.

Thanks Peter for the post and thank you everyone for the comments. Let me try to clarify a bit. We're looking for an absolute foundation of trust on top of which we can build a world safe for AGI. We believe that we need to adopt a "Security Mindset" in which AGI's either on their own or controlled by malicious humans need to be considered a full on adversaries. The only two absolute guarantees that we have are mathematical proof and the laws of physics. Even the most powerful AGI can't prove a falsehood or violate the laws of physics. Based on these we show how to build a network of "provable contracts" that provide absolute guardrails around dangerous actions. As a commenter points out, figuring out which actions we need to protect against and what the rules should be is absolutely essential and not at all trivial! In fact, I believe that should be one of the primary activities of humanity for the next decade! 

Very interesting thought experiment!

One place where it might fall down is that our disutility for causing deaths is probably not linear in the number of deaths, just as our utility for money flattens out as the amount gets large. In fact, I could imagine that its value is connected to our ability to intuitively grasp the numbers involved. The disutility might flatten out really quickly so that the disutility of causing the death of 3^^^^3 people, while large, is still small enough that the small probabilities from the induction are not overwhelmed by it.