Finally, today’s social mechanisms like money, contracts, voting, and the structures of governance, will also need to be updated for the new realities of an AI-driven society. Here too, the underlying rules of social interaction can be formalized, provably effective social protocols can be designed, and secure hardware implementing the new rules synthesized using powerful theorem proving AIs.
Do you envision being able to formalize social systems and desirable properties for them, based on current philosophical understanding of topics like human values/goals and agency / decision theory? I don't, and also think philosophical progress on these topics is not happening fast enough to plausibly solve the problems in time. (Automating philosophy via AI could help solve these problems, but that also seems a hard problem to me, and extremely neglected.) This quote from my Work on Security Instead of Friendliness? gives a flavor of the kind of thing I'm worried about:
What does it mean to have "secure property rights", anyway? If I build an impregnable fortress around me, but an Unfriendly AI causes me to give up my goals in favor of its own by crafting a philosophical argument that is extremely convincing to me but wrong (or more generally, subverts my motivational system in some way), have I retained my "property rights"? What if it does the same to one of my robot servants, so that it subtly starts serving the UFAI's interests while thinking it's still serving mine? How does one define whether a human or an AI has been "subverted" or is "secure", without reference to its "goals"?
I also have more general skepticism about provable methods, based on my experience with provable security in cryptography. Although my experience is now more than a decade old, and I would be interested in hearing from people who have more recent experience in such areas.
What I'm afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of "safety" used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace "safety" with "security". These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I'm sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin, and there is no decades of time to refine the proof techniques and formalizations. There's good recent review of the history of provable security, titled Provable Security in the Real World, which might help you understand where I'm coming from.
Do you envision being able to formalize social systems and desirable properties for them, based on current philosophical understanding of topics like human values/goals and agency / decision theory? I don't, and also think philosophical progress on these topics is not happening fast enough to plausibly solve the problems in time.
I think there are both short-term and long-term issues. In the short-term society is going to have to govern a much wider range of behaviors than we have today. In the next few years we are likely to have powerful AIs taking actions online and in the physical world both autonomously or at the behest of humans. I think that the most immediate issue is that today's infrastructure is likely to be very vulnerable to AI-powered attacks. The basic mechanisms of society are likely to be very vulnerable: voting, communication, banking, etc. Already we are seeing AI-driven hacking, impersonation, fake news, spam, etc. Next year's more powerful AI agents will likely take these to the next level.
Social media sites are already getting overwhelmed by spam, fake images, fake videos, blackmail attempts, phishing, etc. The only way to counteract the speed and volume of massive AI-driven attacks is with AI-powered defenses. These defenses need rules. If those rules aren't formal and proven robust, then they will likely be hacked and exploited by adversarial AIs. So at the most basic level, we need infrastructure rules which are provably robust against classes of attacks. What those attack classes are and what properties those rules guarantee is part of what I'm arguing we need to be working on right now. It looks like the upheaval is coming whether we want it or not. And the better prepared we are for it, the better the outcome.
Over the longer term (which may not be that long if major societal transformation comes around 2030), I totally agree that there are many important philosophical issues that we will need to address. And I agree that we shouldn't try to solve all those problems right now. But we should create a path along which we can make the choices and which is implementable with the technology which appears imminent. Today, democratic societies use processes like voting to choose governance rules. The blockchain world has explored various precise variants of those procedures. I think a good path forward might involve precisely formalizing effective mechanisms like prediction markets, quadratic voting, etc. so that we have confidence that future social infrastructure actually implements it.
It appears that rather simple provable hardware (what Max and I called "provable contracts") can enable the solution of many social dilemmas to produce better outcomes for all. Provable hardware can impose joint constraints on the choices of multiple parties and can include externalities in the execution of contracts. If implemented correctly, these could dramatically improve the efficiency, fairness, and effectiveness of human societies.
I think a good path forward might involve precisely formalizing effective mechanisms like prediction markets, quadratic voting, etc. so that we have confidence that future social infrastructure actually implements it.
In the Background section, you talk about "superhuman AI in 2028 or 2029", so I interpreted you as trying to design AIs that are provably safe even as they scale to superhuman intelligence, or designing social mechanisms that can provably ensure that overall society will be safe even when used by superhuman AIs.
But here you only mention proving that prediction markets and quadratic voting are implemented correctly, which seems like a much lower level of ambition, which is good as far as feasibility, but does not address many safety concerns, such as AI-created bioweapons, or the specific concern I gave in my grandparent comment. Given this lower level of ambition, I fail to see how this approach or agenda can be positioned as an alternative to pausing AI.
Yes, I think there are many, many more possibilities as these systems get more advanced. A number of us are working to flesh out possible stable endpoints. One class of world states are comprised of humans, AIs, and provable infrastructure (including manufacturing and datacenters). In that world, humans have total freedom and abundance as long as they don't harm other humans or the infrastructure. In those worlds, it appears possible to put absolute limits on the total world compute, on AI agency, on Moloch-like economic competition, on harmful evolution, on uncontrolled self-improvement, on uncontrolled nanotech, and on many other of today's risks.
But everything depends on how we get from the present state to that future state. I think the Aschenbrenner timelines are plausible for purely technological development. But the social reality may change them drastically. For example, if the social upheaval leads to extensive warfare, I would imagine that GPU datacenters and chip fabs would be likely targets. That could limit the available compute sufficiently to slow down AI development dramatically.
I've spoken about some of the future possibilities in different talks like this one:
but I think it's easier for most people to think about the more immediate issues before considering the longer term consequences.
Given this lower level of ambition, I fail to see how this approach or agenda can be positioned as an alternative to pausing AI.
Oh, any low level of ambition is just for this initial stage, without which the more advanced solutions can't be created. For example, without a provably unbreakable hardware layer supporting provably correct software, you can't have social contracts which everyone can absolutely trust to be carried out as specified.
And I think pausing would be great if it were possible! I signed the Pause letter and I think it generated good and important discussion, but I don't think it slowed down AI development. And the instability of pausing means that at best we get a short extension of the AI development timeline. For that to be of long term value, humanity needs to use the extra time to do something! What should it do? It should flesh out all the pieces needed to effectively implement provable safety!
For example, this came out yesterday showing the sorry state of today's voting machines: "The nation’s best hackers found vulnerabilities in voting machines — but no time to fix them" https://www.politico.com/news/2024/08/12/hackers-vulnerabilities-voting-machines-elections-00173668 And they've been working on making voting machines secure for decades!
Social media sites are already getting overwhelmed by spam, fake images, fake videos, blackmail attempts, phishing, etc. The only way to counteract the speed and volume of massive AI-driven attacks is with AI-powered defenses. These defenses need rules. If those rules aren't formal and proven robust, then they will likely be hacked and exploited by adversarial AIs. So at the most basic level, we need infrastructure rules which are provably robust against classes of attacks. What those attack classes are and what properties those rules guarantee is part of what I'm arguing we need to be working on right now.
Maybe it would be more productive to focus on these nearer-term topics, which perhaps can be discussed more concretely. Have you talked to any experts in formal methods who think that it would be feasible (in the near future) to define such AI-driven attack classes and desirable properties for defenses against them, and do they have any specific ideas for doing so? Again from my own experience in cryptography, it took decades to formally define/refine seemingly much simpler concepts, so it's hard for me to understand where your relative optimism comes from.
I am a big fan of verification in principle, but don't think it's feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I'd be happy to make bets with anyone interested that
for up to 10k of my dollars at up to 10:1 odds (i.e. min your $1k), placed by end of 2024, to resolve by end of 2026 or as agreed. (I'm somewhat dubious about very long-term bets for implementation reasons) I'm also happy to discuss conditioning on a registered attempt.
For what it's worth, I was offered and chose to decline co-authorship of Towards Guaranteed Safe AI. Despite my respect for the authors and strong support for this research agenda (as part of a diverse portfolio of safety research), I disagree with many of the central claims of the paper, and see it as uncritically enthusiastic about formal verification and overly dismissive of other approaches.
I do not believe that any of the proposed approaches actually guarantee safety. Safety is a system property,[1] which includes the surrounding social and technical context; and I see serious conceptual and in-practice-fatal challenges in defining both a world-model and a safety spec. (exercise: what could these look like for a biology lab? Try to block unsafe outcomes without blocking Kevin Esvelt's research!)
I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.
- (8) won't be attempted, or will fail at some combination of design, manufacture, or just-being-pickable. This is a great proposal and a beautifully compact crux for the overall approach.
I agree with you that this feels like a 'compact crux' for many parts of the agenda. I'd like to take your bet, let me reflect if there's any additional operationalizations or conditioning.
However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.
FWIW in Towards Guaranteed Safe AI I we endorse this: "Moreover, while we have argued for the need for verifiable quantitative safety guarantees, it is important to note that GS AI may not be the only route to achieving such guarantees. An alternative approach might be to extract interpretable
policies from black-box algorithms via automated mechanistic interpretability... it is ultimately an empirical question whether it is easier to create interpretable world models or interpretable policies in a given domain of operation."
I agree with you that this feels like a 'compact crux' for many parts of the agenda. I'd like to take your bet, let me reflect if there's any additional operationalizations or conditioning.
quick proposals:
Unsolicited suggestion: it is probably useful for y'all to define further what "pick a lock"means -- e.g., if someone builds a custom defeat device of some sort, that does some sort of activity that is non-destructive but engages in a mechanical operation very surprising to someone thinking of traditional lock-picking methods -- does that count?
(I think you'd probably say yes, so long as the device isn't, e.g., a robot arm that's nondestructively grabbing the master key for the lock out of Zac's pocket and inserting it into the lock, but some sort of definining-in-advance would likely help.)
Nonetheless, think this would be awesome as an open challenge at Defcon (I suspect you can convince them to Black Badge the challenge...)
it is probably useful for y'all to define further what "pick a lock"means
Well, a lot of the difficulty of any kind of formal proof is indeed that any specification you come up with will have holes you didn't anticipate. As such, coming up with a solid definition of what "pick a lock" means is a large part of the difficulty of this formal verification endeavor, and as such I don't think it makes sense to try to get that out of the way before the bet is even made. I think deferring to a trusted third arbiter whether the definition chosen for the proof is indeed an adequate definition would be a better choice.
I mean, I think it's worth doing an initial loose and qualitative discussion to make sure that you're thinking about overlapping spaces conceptually. Otherwise, not worth the more detailed effort.
This seems mostly good to me, thank you for the proposals (and sorry for my delayed response, this slipped my mind).
OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn't count)
Why this condition? It doesn't seem relevant to the core contention, and if someone prototyped a single lock using a GS AI approach but didn't figure out how to manufacture it at scale, I'd still consider it to have been an important experiment.
Besides that, I'd agree to the above conditions!
I don't think that a thing you can only manufacture once is a practically usable lock; having multiple is also practically useful to facilitate picking attempts and in case of damage - imagine that a few hours into an open pick-this-lock challenge, someone bent a part such that the key no longer opens the lock. I'd suggest resolving neutral in this case as we only saw an partial attempt.
Other conditions:
I nominate @raemon as our arbiter, if both he and you are willing, and the majority vote or nominee of the Lightcone team if Raemon is unavailable for some reason (and @habryka approves that).
(note: for @mentioning to work, you need to be in the LessWrong Docs editor, or in markdown you actually type out [@Raemon](https://www.lesswrong.com/users/raemon?mention=user
) (the "@" in the "@Raemon" doesn't actually do anything, the important part is that the url has mention=user
. We should probably try to make this work more intuitively in markdown but it's not quite obvious how to do it.)
I think in practice my take here would probably be something like a deferral to the Lightcone Team majority vote, since I'm not very informed on this field, but I'm happy to own the metacognition of making sure that happens and sanity checking the results.
Given your rationale I'm onboard for 3 or more consistent physical instances of the lock have been manufactured.
Lets 'lock' it in.
Nice! I look forward to seeing how this resolves.
Ah, by 'size' I meant the stakes, not the number of locks - did you want to bet the maximum $1k against my $10k, or some smaller proportional amount?
I am a big fan of verification in principle, but don't think it's feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I'd be happy to make bets that
- (3) won't happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
- (8) won't be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.
- with some mutually agreed operationalization against the success of other listed ideas with a physical component
I'm afraid I'm not into betting and I'm not saying these projects will happen, but rather that they should happen. And I'm not saying they are easy but rather that they are important and I don't see any huge barriers to doing them.
In general, non-deterministic hardware isn't a problem for formalization. We're not trying to simulate these systems, we're trying to put bounds on their possible behaviors. In general, proofs work with enclosures that are guaranteed to contain the actual behavior. As far as I know, the various floating point formats are all precisely defined, so there's no problem there. I know that some processor specs stated that their implementation of functions like sin and cos were only guaranteed up the last bit or something. That's an annoyance but not a real problem for formalization. A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don't overlap the unsafe region.
Probabilistic programs do provide greater challenges. It's not a problem to formally model and prove properties of probability distributions. But it's often no longer possible to provide 100% guarantees of desired safety properties. The approach of precise probabilistic theories like "Probably approximately correct learning" (https://en.wikipedia.org/wiki/Probably_approximately_correct_learning) is to provide a provable bound on the probability of failure.
There is subtlety in adversarial situations where there is a difference between uncertain actions which the adversary can't affect for which an ordinary probabilistic analysis suffices and those where the adversary can influence apparently random variables. For example, I believe we need to create formal models of all of the engineering disciplines (eg. mechanical engineering, electrical engineering, chemical engineering, etc.). The rules that those engineers use are already quite formal and there are heuristic arguments as to why they are grounded in fundamental physics. But in an adversarial context, an adversary may be able to influence a deeper level of physics in a way that can break the assumptions of the model. For example, mechanical engineers have models of the elastic properties of steel beams but if an adversary can alter the manufacture of those beams in a way that makes them vulnerable to certain kinds of failure, then the mechanical engineer's assumptions can be invalidated. So the formal models also need to model potential vulnerabilities in the assumptions of the higher level models.
I believe that these developments are not optional. It's just a question of whether humanity prepares now for what's coming or waits until after the basic assumptions of our infrastructure and society are upended.
A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don't overlap the unsafe region.
For actual inference stacks in use (e.g. llama-3-405b 8 bit float) interval propagation will blow up massively and result in vacuous bounds. So, you'll minimally have to assume that floating point approximation is non-adversarial aka random (because if it actually was adversarial, you would probably die!).
(My supporting argument for this is that you see huge blow ups on interval arguments for even tiny models on toy tasks[1] and that modern inference stacks have large amounts of clearly observable floating point non-determinism, which means that the maximum amount of deviation would actually have strong effects if all errors (including across token positions) were controlled by an adversary.)
Though note that the interval propagation issues in this case aren't from floating point rounding and are instead from imperfections in the model weights implementing the algorithm. ↩︎
Intervals are often a great simple form of "enclosure" in continuous domains. For simple functions there is also "interval arithmetic" which cheaply produces a bounding interval on the output of a function given intervals on its inputs: https://en.wikipedia.org/wiki/Interval_arithmetic But, as you say, for complex functions it can blow up. For a simple example of why, consider the function "f(x)=x-x" evaluated on the input interval [0,1]. In the simplest interval arithmetic, the interval for subtraction has to bound the worst possible members of the intervals of its two inputs. In this case that would be a lower bound of "0-1" and an upper bound of "1-0" producing the resulting interval: [-1,1]. But, of course, "x-x" is always 0, so this is huge over-approximation. People have developed all kinds of techniques for capturing the correlations between variables in evaluating circuits on intervals. And you can always shrink the error by splitting the input intervals and doing "branch and bound". But all of those are just particular implementation choices in proving bounds on the output of the function. Advanced AI theorem provers (like AlphaProof) can use very sophisticated techniques to accurately get the true bound on the output of a function.
But, it may be that it's not a fruitful approach to try to bound the behavior of complex neural nets such as transformers. In our approach, we don't need to understand or constrain a complex AI generating a solution or a control policy. Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can't do that, then it has no business taking actions in a dangerous setting.
Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can't do that, then it has no business taking actions in a dangerous setting.
This seems near certain to be cripplingly uncompetitive[1] even with massive effort on improving verification.
If applied to all potentialy dangerous applications. ↩︎
I agree you can do better than naive interval propagation by taking into account correlations. However, it will be tricky to get a much better bound while avoiding having this balloon in time complexity (all possible correlations requires exponentional time).
More strongly, I think that if an adversary controlled the non-determinism (e.g. summation order) in current efficient inference setups, they would actually be able to strongly influence the AI to an actualy dangerous extent - we are likely to depend on this non-determinism being non-adversarial (which is a reasonable assumption to make).
(And you can't prove a false statement...)
See also heuristic arguments which try to resolve this sort of issue by assuming a lack of structure.
I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.
for what it's worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.
Mod note: this comment was originally written in Markdown in the rich text editor. I converted it. Let me know if that's not what you want, Zac.
Unfortunately, that does not appear to be a stable solution. Even if the US paused its AI development, China or other countries could gain an advantage by accelerating their own work.
Arguing-for-pausing does not need to be a stable solution to help. If it buys time, that's already helpful. If the US pauses AI development, but China doesn't, that's still less many people working on AI that might kill everyone.
Mark Zuckerberg has concluded that it is better to open source powerful AI models for the white hats than to try to keep them secret from the black hats.
Of course, LLama models aren't actually open source - here's my summary, here's LeCun testifying under oath (which he regularly contradicts on twitter). I think this substantially undercuts Zuckerberg's argument and don't believe that it's a sincere position, just tactically useful right now.
Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
Such a proof would be subject to its definition of "mechanical picking" and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a "pick"?)
I don't dispute the value of formal proofs for safety. If accomplished, they move the conversation to "is the proof correct?" and "are we proving the right thing?". Both are steps in the right direction, I think.
Absolutely! In different circumstances different classes of adversarial capabilities are appropriate. Any formal security claim will always be relative to a specified class of attackers. For the most secure infrastructure which must be secure against any physical agent (including the most powerful AGIs), you would show security properties against any agent which obeys the laws of physics. Max and I talk about "provable contracts" which sense any form of physical tampering and erase internal cryptographic keys to guarantee that contained information cannot be extracted.
Today, many residential front doors are vulnerable to "bump key" attacks which involve inserting a key with all the key bittings filed down and tapping it with a mallet to rapidly open the door with no skill. There would be tremendous benefit to mechanical locks which are not vulnerable to this or more sophisticated purely mechanical attacks.
It would also be great to create designs which are provably invulnerable to the kind of sophisticated key you suggest. A general approach is to physically isolate the part of the lock which senses the pattern on the key from the part which checks that that pattern is correct. In today's locks, those two functions happen together and lockpickers exploit that. In the kinds of designs I'd prefer to see, no information about the correct key is accessible from the part of the lock accessible to the attacker.
Similar issues plague hardware correctness and security, and it will be a much larger project to replace today's flawed and insecure hardware. Max and Steve propose formal methods grounded in mathematical physics to produce provably safe physical designs. The same AI techniques which are revolutionizing theorem proving and provable software synthesis are also applicable to provable hardware design.
I'd like to add that hardware formal verification is not a virgin area.
Of course, physics is a huge complication, but in terms of algorithms to be verified, hardware verification is much more tractable than software verification. E.g. LCF-LSM has been published in early 1980-s, M. J. C. Gordon. LCF-LSM: A system for specifying and verifying hardware. Technical Report 41, University of Cambridge Computer Laboratory, 1983, and defense-related prototypes followed soon thereafter, https://apps.dtic.mil/sti/pdfs/ADA168248.pdf (United Kingdom, 1985).
After the infamous Pentium FDIV bug, Intel has been giving multi-million grants to formal methods academic researchers in the 1990-s. What I don't know is to what extent methods of formal verification are actually adopted by the hardware industry (we do know that hardware people are much more interested in having their systems correct, compared to their software counterparts, since bug fixes in the shipped hardware are next to impossible).
Thanks for the links! Yes, the cost of errors is high for chips, so chip companies have been very supportive of formal methods. But they've primarily focused on the correctness of digital circuits. There are many other physical attack surfaces.
For example, most DRAM chips are vulnerable to the "Rowhammer" attack https://en.wikipedia.org/wiki/Row_hammer and as memories get denser, they become more vulnerable. This involves memory access patterns which flip bits in neighboring rows which can be exploited to discover cryptographic keys and break security assumptions in processors. The solutions so far are fairly hacky and attackers have found work arounds. A full solution likely will involve a formal model including the electromagnetic and material physics of the memory chip. This might be used to provably characterize all problematic access patterns and to provably translate software into equivalent versions which provably don't produce the dangerous accesses.
Another example from yesterday is the AMD "Sinkclose" flaw which affects all AMD chips back to 2006 and makes them vulnerable to attacks after which "You basically have to throw your computer away." In a world of powerful AI-enabled attackers, that kind of flaw is unlikely to be acceptable.
With the recent developments in drones, humanoid robots, and autonomous vehicles, we are likely to need provable constraints on the behavior of a wide variety of physical systems. In the paper with Max we argue that this is possible and that the laws of physics provide an absolute constraint on the behavior of AIs. But it's not easy! And the time is likely short.
"‘Sinkclose’ Flaw in Hundreds of Millions of AMD Chips Allows Deep, Virtually Unfixable Infections" https://archive.ph/MsA0m
Some Useful Projects
Even with powerful AI assistance, a pretty confident lower bound on what's required to achieve this seems like a 5-10-year Manhattan project; so I expect most of this wouldn't happen in time for Aschenbrenner-like timelines, even with significant resources poured in.
Timelines are certainly short. But if AI systems progress as rapidly as Aschenbrenner and others are predicting, then most of the work will be done by AIs. Of course, we have to know what we want them to accomplish! And that's something we can work on before the AIs are dangerous or powerful enough to build the safe infrastructure.
Max and I argued that systems which are not built with security proofs will likely have vulnerabilities and that powerful AI's can search form them using formal models and are likely to find them. So if things unfold in an adversarial way, any software, hardware, or social systems not built with provable safety will likely be exploited.
I buy the automation argument, I don't expect that to be the (main) bottleneck under short timelines. My worries are about needing to change the behavior of humans (e.g. programmers starting to integrate formal verification, etc.) and to upgrade large parts of existing infrastructure, especially when it's physical.
I agree that the human responses to all of this are the great unknown. The good news is that human programmers don't need to learn to do formal verification, they just need to realize that it's important and to get the AIs to do it. My suspicion is that these big shifts won't happen until we start having AI-caused disasters. It's interesting to look at the human response to the CrowdStrike disaster and other recent security issues to get some sense of how the responses might go. It's also interesting to see the unbelievable speed with which Microsoft, Meta, Google, etc. have been building huge new GPU datacenters in response to the projected advances in AI.
I see a lot of mention if "dafny" here, is this a language that's getting traction? I'm surprised there's not any examples on their website of real companies using it for real problems.
This looks very interesting area.
In September 2023, Max Tegmark and Steve Omohundro proposed "Provably Safe AI" as a strategy for AI Safety. In May 2024, a larger group delineated the broader concept of "Guaranteed Safe AI" which includes Provably Safe AI and other related strategies. In July, 2024, Ben Goldhaber and Steve discussed Provably Safe AI and its future possibilities, as summarized in this document.
Background
In June 2024, ex-OpenAI AI Safety Researcher Leopold Aschenbrenner wrote a 165-page document entitled "Situational Awareness, The Decade Ahead" summarizing AI timeline evidence and beliefs which are shared by many frontier AI researchers. He argued that human-level AI is likely by 2027 and will likely lead to superhuman AI in 2028 or 2029. "Transformative AI" was coined by Open Philanthropy to describe AI which can "precipitate a transition comparable to the agricultural or industrial revolution". There appears to be a significant probability that Transformative AI may be created by 2030. If this probability is, say, greater than 10%, then humanity must immediately begin to prepare for it.
The social changes and upheaval caused by Transformative AI are likely to be enormous. There will likely be many benefits but also many risks and dangers, perhaps even existential risks for humanity. Today's technological infrastructure is riddled with flaws and security holes. Power grids, cell service, and internet services have all been very vulnerable to accidents and attacks. Terrorists have attacked critical infrastructure as a political statement. Today's cybersecurity and physical security barely keeps human attackers at bay. When these groups obtain access to powerful cyberattack AI's, they will likely be able to cause enormous social damage and upheaval.
Humanity has known how to write provably correct and secure software since Alan Turing's 1949 paper. Unfortunately, proving program correctness requires mathematical sophistication and it is rare in current software development practice. Fortunately, modern deep learning systems are becoming proficient at proving mathematical theorems and generating provably correct code. When combined with techniques like "autoformalization," this should enable powerful AI to rapidly replace today's flawed and insecure codebase with optimized, secure, and provably correct replacements. Many researchers working in these areas believe that AI theorem-proving at the level of human PhD's is likely about two years away.
Similar issues plague hardware correctness and security, and it will be a much larger project to replace today's flawed and insecure hardware. Max and Steve propose formal methods grounded in mathematical physics to produce provably safe physical designs. The same AI techniques which are revolutionizing theorem proving and provable software synthesis are also applicable to provable hardware design.
Finally, today's social mechanisms like money, contracts, voting, and the structures of governance, will also need to be updated for the new realities of an AI-driven society. Here too, the underlying rules of social interaction can be formalized, provably effective social protocols can be designed, and secure hardware implementing the new rules synthesized using powerful theorem proving AIs.
What's next?
Given the huge potential risk of uncontrolled powerful AI, many have argued for a pause in Frontier AI development. Unfortunately, that does not appear to be a stable solution. Even if the US paused its AI development, China or other countries could gain an advantage by accelerating their own work.
There have been similar calls to limit the power of open source AI models. But, again, any group anywhere in the world can release their powerful AI model weights on BitTorrent. Mark Zuckerberg has concluded that it is better to open source powerful AI models for the white hats than to try to keep them secret from the black hats. In July 2024, Meta released the Llama 3.1 405B large language model which has a comparable performance to the best closed source models. We should expect access to the most powerful AI models to grow, and that this will enable both new levels of harmful cyber/social attacks and powerful new safety and security capabilities.
Society tends to respond to threats only after they have already caused harm. We have repeatedly seen that once harmful actions have occurred, however, citizens and politicians rapidly demand preventive measures against recurrences. An AI-powered cyberattack which brings down the power grid, financial system, or communication networks would likely cause a huge demand for AI-secure infrastructure. One powerful way to have an impact is to prepare a strategy in advance for responding to that demand. The Manifold prediction market currently estimates a 29% chance of an AI "warning shot" before October 20, 2026. If this is accurate, it would be of great value to create detailed plans over the next two years.
Here is a potential positive scenario: In 2025, groups who see the potential for "Provably Safe Technology" fine tune Llama 4 to create a powerful model for generating verified software and hardware. This model spreads widely and by using "autoformalization," groups are able to feed it their current documentation and code and have it generate highly optimized, provably correct, provably secure replacements. The world's software vulnerabilities rapidly disappear. Chipmakers use the new AI to rapidly generate provably correct, efficient, and provably secure chips for every function. The new AI redesigns cryptographic protocols to eliminate quantum-vulnerabilities and other cryptographic weaknesses. The AI also generates new social mechanisms, eliminating current inefficiencies and vulnerabilities to corruption. And the new mechanisms enable governments to create mutually-beneficial treaties with hardware enforcement. "Provably-Constrained Self-Improvement" replaces "Unconstrained Self-Improvement" and GPU hardware is changed to only execute constrained models. AI improves to help humanity and we create infrastructure that enables all humans to thrive!
With more provably secure infrastructure in place, the surface area for disruption from AI misuse will be reduced, as well as by exploitation by rogue autonomous AIs. Expanding the ‘formally safe zone’ will increase the difficulty of takeover by AIs in general.
In the longer term, the "provably safe" approach has the potential to reduce the risk of AI takeover and to solve many of humanity's current problems. This requires building infrastructure which is invulnerable to AI attacks (as outlined in the Tegmark and Omohundro paper). It would enable a future in which powerful AIs can be tasked with solving humanity's biggest challenges (eg. curing cancer, solving the climate crisis, eliminating social dilemmas, etc.) while provably preventing runaway self-improvement.
Some Useful Projects
Strategically, the above considerations suggest that we emphasize preparation for the use of powerful future AI tools rather than putting large efforts into building systems with today's tools. For example, today a software team might spend a year designing, building, and testing a system with a good user interface to use AI to safely control an industrial process. In two years, an AI synthesis tool may be able to generate a formally verified implementation in an afternoon from a specification of the requirements. As AI software synthesis and theorem proving systems advance, they should be able to rapidly impact human infrastructure.
The currently best developed of the needed tools are software formal methods systems like the Dafny Programming and Verification Language which can prove that code satisfies a formal specification. But they require accurate specifications of the semantics of the programming language and of the desired software behavior.
Below are ideas for projects which can advance aspects of the Provably Safe agenda by advancing the principles and methods for building Provably Safe systems, or by demonstrating the potential of provably safe systems using current or likely-to-be-created-soon tools.
1) Improve methods for creating formal specifications: of systems, desired behaviors, and safety requirements. For example, there should be databases of existing specification elements and semantic search to find and verify desired behaviors.
2) Formally translate between formal languages: There are many theorem proving languages (eg. Lean, Coq, Isabelle, MetaMath, etc.) and many formal methods languages (eg. Dafny, Alloy, PRISM, etc.) but almost no interconvertibility between these. All the powerful mathematical theorems in Lean should be available to a programmer using Dafny. Modern AI tools should be able to translate between any of these and to represent the semantics of each.
3) A semantic library for probabilistic programming and machine learning: All of the popular statistical and machine learning methods should be formally specified with a range of provable bounds on behavior.
4) Precise formal specifications for physical systems: Each of the engineering disciplines (eg. Mechanical Engineering, Electrical Engineering, Chemical Engineering, etc.) needs to be formally codified and grounded in fundamental physics. Engineers should be able to provably verify properties of their designs. AI systems should be able to generate designs which provably meet desired specification requirements.
5) Formal models of security and cryptographic properties: Provable verification of human designs, automatic generation of verifiable designs, autoformalization of existing designs and resynthesis to eliminate flaws and security holes.
6) Demonstrate provable safety in a domain: by showing that an adversary with a specified set of capabilities is not able to push a designed system into a specified unsafe state.
7) Build a formal model of mechanical engineering: capturing all the elements that today's engineers use in mechanical designs. While much of mathematics has been formalized, to great success, many mechanical engineering practices are not represented.
8) Provably unpickable mechanical lock: As an example, a concrete project which combines security with mechanical engineering would be to design a "provably unpickable mechanical lock". This would involve simulating designs of mechanical locks and processes for mechanically picking them. Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
9) Composable Specifications Library: Develop a comprehensive library of safety specifications that can be composed and reused across different GS AI projects. This initiative would include
a) Cataloging and standardizing existing specifications from various domains.
b) Creating tools and methodologies for combining and adapting specifications for new use cases.
c) Implementing adversarial processes to continually generate and refine specifications, potentially leveraging automated systems (example)
d) Researching methods to provide assurance of specification completeness and coverage.
10) Guaranteed Safe by Design Development Process: At the heart of the agenda is using a world model, safety specifications, and a verifier (described in the recent paper Towards Guaranteed Safe AI) to verify the plans and actions of AIs. This is a framework and methodology for developing AI systems, and needs to be operationalized for real world systems. Examples of projects here include:
a.) defining the adversarial exploration process for generating specs
b.) creating the process for defining the ‘relevant’ world model for a domain
c.) creating processes for updating world models and safety specifications over time
d.) creating a prize and incentive system for generating the pieces of a system. This could be meta - for example, can we create the safety specs needed for operationalizing prize generation to resist adversary exploitation?
11) Provable hardware projects: Create foundational building blocks for guaranteed safe AI systems, focusing on components like tamper-evident sensors, secure memory, and verifiable computation units. An initial goal is to demonstrate that it's possible to construct hardware with mathematically provable security properties, providing a robust foundation for safe AI development.
12) Digital Twins for Guaranteed Safe AI: Create accurate digital representations of both cyber-cyber systems and cyber-physical systems following GS AI principles, and projects which could be pushed forward by non-profit or for-profit companies. Ideally these would be interoperable simulations, perhaps building on existing initiatives like NVIDIA's Omniverse. Examples include building digital twins for existing AI evaluation workflows, using digital twins for cybersecurity, and building provably secure buildings which include specs that represent a wide range of human preferences.
13) Demo Networks: Create small-scale, fully provable secure systems that serve as evidence of the feasibility of Guaranteed Safe AI approaches. These demo networks would showcase end-to-end security, from hardware to software, would provide tangible examples for stakeholders and policymakers, and could be adopted during "warning shot" scenarios to illustrate the practicality and effectiveness of GS AI principles. For example, a provably secure camera system, or a provably secure communication network.
14) Coordination and Governance: Industry coordination projects (example) that develop the protocols and standards for formally verifying software and hardware, or working on policy ideas and initiativesfor using strong hardware enforced mechanisms to verify treaties for executing constrained models.