I’ve seen a fair amount of skepticism about the “Provably Safe AI” paradigm, but I think detractors give it too little credit. I suspect this is largely because of idea inoculation - people have heard an undeveloped or weak man version of the idea, for example, that we can use formal methods to state our goals and prove that an AI will do that, and have already dismissed it. (Not to pick on him at all, but see my question for Scott Aaronson here.)

I will not argue that Guaranteed Safe AI solves AI safety generally, or that it could do so - I will leave that to others. Instead, I want to provide a concrete example of a near-term application, to respond to critics who say that proveability isn’t useful because it can’t be feasibly used in real world cases when it involves the physical world, and when it is embedded within messy human systems. [Edit to add: Doing this does require assumptions in addition to simple provability, as outlined below, so as @tdietterich, suggested, this leads to the amended title.] 

I am making far narrower claims than the general ones which have been debated, but at the very least I think it is useful to establish whether this is actually a point of disagreement. And finally, I will admit that the problem I’m describing would be adding proveability to a largely solved problem, but it provides a concrete example for where the approach is viable.

A path to provably safe autonomous vehicles

To start, even critics agree that formal verification is possible, and is already used in practice in certain places. And given (formally specified) threat models in different narrow domains, there are ways to do threat and risk modeling and get different types of guarantees. For example, we already have proveably verifiable code for things like microkernels, and that means we can prove that buffer overflows, arithmetic exceptions, and deadlocks are impossible, and have hard guarantees for worst case execution time. This is a basis for further applications - we want to start at the bottom and build on provably secure systems, and get additional guarantees beyond that point. If we plan to make autonomous cars that are provably safe, we would build starting from that type of kernel, and then we “only” have all of the other safety issues to address.

Secondly, everyone seems to agree that provable safety in physical systems requires a model of the world, and given the limits of physics, the limits of our models, and so on, any such approach can only provide approximate guarantees, and proofs would be conditional on those models. For example, we aren’t going to formally verify that Newtonian physics is correct, we’re instead formally verifying that if Newtonian physics is correct, the car will not crash in some situation.

Proven Input Reliability

Given that, can we guarantee that a car has some low probability of crashing?

Again, we need to build from the bottom up. We can show that sensors have some specific failure rate, and use that to show a low probability of not identifying other cars, or humans - not in the direct formal verification sense, but instead with the types of guarantees typically used for hardware, with known failure rates, built in error detection, and redundancy. I’m not going to talk about how to do that class of risk analysis, but (modulus adversarial attacks, which I’ll mention later,) estimating engineering reliability is a solved problem - if we don’t have other problems to deal with. But we do, because cars are complex and interact with the wider world - so the trick will be integrating those risk analysis guarantees that we can prove into larger systems, and finding ways to build broader guarantees on top of them.

But for the engineering reliability, we don’t only have engineering proof. Work like DARPA’s VerifAI is “applying formal methods to perception and ML components.” Building guarantees about perception failure rates based on the sensors gives us another layer of proven architecture to build on. And we could do similar work for how cars deal with mechanical failures, other types of sensor failures, and so on as inputs to the safety model. Of course, this is not a challenge particularly related to AI, and it is a (largely solved) problem related to vehicle reliability, and traditional engineering analysis and failure rates could instead be one of the inputs to the model assumptions, with attendant issues dealing with uncertainty propagation so we get proven probabilistic guarantees at this level as well.

Proven Safe Driving

Proving that a car is not just safe to drive, but is being driven safely requires a very different set of assumptions and approaches. To get such an assurance, we’d need provable formal statements about what safe driving is, in order to prove them. And it’s important to figure out what safety means in multi-agent sociotechnical systems. For example, we say someone is a safe driver when they drive defensively, even if another car could crash into them. That’s because safety in multiperson systems isn’t about guaranteeing no harm, it’s about guaranteeing that the agents behavior doesn’t cause that harm.

Luckily, it turns out that there’s something remarkably close to what we want already. Responsibility Sensitive Safety (RSS) [Edit to add: with a formal paper laying out the details and proof here.] formalizes what a car needs to do not to cause accidents. That is, if a car drives safely, any accidents which could occur will be attributable to other cars actions. In the case of RSS, it’s provable that if other cars follow the law, and/or all cars on the road abide by the rules[Edit to add: , and no pedestrians jump out in front of the car when not at a crosswalk, or do similar things which the car cannot dodge, nor perform other illegal acts, and there are no acts of god such as meteors hitting the car or lightning strikes, then with some finite and low probability,] those cars will only have [edit: cause] accidents if their sensors are incorrect. Of course, if another car[ or pedestrian] fails to abide by the rules, safety isn’t guaranteed - but as we’ll mention later, safe driving can’t mean that the car cannot be hit by a negligent or malicious driver, otherwise the safety we’re working towards is impossible! 

Proving that cars won’t cause crashes now builds on the risk analysis we described as provable above. Relying on assumptions about other car behavior is a limit to provable safety - but one that provides tremendously more leverage for proof than just removing buffer overflow attacks. That is, if we can show formally that given correct sensor data, a car will only do things allowed in the RSS model, and we build that system on top of the sensors described above, we have shown that it is safe in a very strong sense[; again, this means the system is proveably safe modulo assumptions]. 

This three part blueprint for provable safety of cars can address the levels in between the provably safe kernel, the responsibility sensitive safety guarantees, and the risk analysis for sensors. If we can prove that code running on the safe kernel can proveably provide the assurances needed for driving, on the condition that the sensors work correctly, and can provide engineering reliability results for those sensors, we have built a system that has provably bounded risk.

Provably Safe ML

Unfortunately, self-driving cars don’t actually just use code of types that can be formally verified, they use neural networks - systems which are poorly understood and vulnerable to a wide variety of hard to characterize failures. Thankfully, we do not need to solve AI safety in general to have safety narrowly. How can we do this?

One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver - it just keeps such systems in a provably safe box.

That isn’t, of course, the only approach. Another is to have the AI system trained to drive the way we want, then use model parroting or a similar approach in order to train a much simpler and fully interpretable model, such that we can verify its properties formally. Alternatively, we can use something like constructible AI in place of black-box systems, and prove properties of the composed parts. In each case, Guaranteed Safe AI is not a tool for guaranteeing AI alignment in general, it is a tool for building specific safe systems.

Adversarial Concerns

Once a self-driving car is constrained to follow the rules of the road with some provable reliability, despite failures in its systems, we still need to worry about other concerns. Most critically, we need to consider adversarial attacks and robustness, on two fronts. The first is indirectly malicious adversarial behavior, accidentally or purposefully using the self-driving cars limitations and rule sets to exploit their behavior. These can be severe, such as causing crashes, as discussed here. But even a safe car cannot eliminate such attacks, as mentioned earlier. In fact we would hope that, say, a car driving towards an autonomous vehicle at high speed would cause the autonomous vehicle to move out of the way, even if that meant it instead hits a stationary object or a different car, if it can do so in ways that reduce damage. Less extreme are attacks that cause the car to move to the side of the road unnecessarily, or create other nuisances for drivers. These acts, while unfortunate, are not safety issues.

However, there is a more widely discussed concern that engineered attacks on our otherwise safe system could “hide” stop signs, as has been shownrepeatedly, or perhaps modify other car’s paint so that the other car is not recognized by sensors. This is definitely a concern that robustness research has been working on, and such work is useful. On the other hand, we do not blame human drivers if someone maliciously removes a stop sign; one again, provable safety does not imply that others cannot cause harm. 

We also note that the claim that an adversarial attack was conducted without violating laws, including traffic laws, does not shield the perpetrator from criminal charges including attempted murder, and the “intentional act exception“ for insurance would mean that the perpetrator of such acts would be personally liable, without any insurance protection. Our extant legal system can handle this without any need for specific changes.

Defining Broader Sociotechnological Proveably Safe AI

There are other domains where the types of safety guarantees we want for AI systems are much stronger than simply not causing harm. For example, an AI system that explains how to make bioweapons would not itself have caused harm, it would only have enabled a malicious actor to do so. But what we have shown is that we can build a sociotechnical definition of responsible behavior that is coherent, and can be specified in a given domain. 

In some cases, similar to the self-driving car example, the rule we desire could closely parallel current law for humans. In the bioweapons case, materially assisting in the commission of a bioterror attack would certainly be illegal for humans, and the same could be required for artificial intelligence. Formalizing this is obviously difficult, but it is not equivalent to the effectively impossible task of fully modeling the entire human body and its interaction with some novel pathogen. (To avoid the need, we can allow the false positive rate to be far higher than the false negative rate, as many humans do when unsure if an action is strictly illegal, and only allow things we're resonably sureare safe.) 

But in other cases, such as autonomy risks from AI, we would expect that the rules needed for AI systems would differ substantially from human law, and definitions for what qualifies as safety would need to be developed before proveablity could be meaningful. 

Conclusion

It seems that building provably safe systems in the real world is far from an unsolvable problem, as long as we restrict the problem to solve to something that is clearly defined. We can imagine similar guarantees for AI systems for cyber-security, with protections against privilege escalation attacks performed by the AI, or for “escape” scenarios, with protections against self-exfiltration, or (as suggested but, in my view mistakenly, derided) for DNA synthesis, with guarantees that all synthesis orders were screened for misuse potential. None of these will stop people from doing these unsafely in ways not covered by safety guarantees, nor will they prevent hypothetical strongly superhuman AI from finding devious edge cases, or inventing and exploiting new physics.

And if that is the concern, looking more towards AI alignment, we could even construct systems with formal guarantees that all planning is approved by a specific process, to scaffold proposals like Humans consulting HCH, or stronger alternatives. But to provide these stronger guarantees, we need some fundamental progress in proveability of types which are needed for more prosaic applications like self-driving. And in the meantime, there are many cases which could benefit from the clearer definitions and stronger guarantees of safety that proveably safe AI would provide.

New Comment
26 comments, sorted by Click to highlight new comments since:

Thank you for this post. You are following the standard safety engineering playbook. It is good at dealing with known hazards. However, it has traditionally been based entirely on failure probabilities of components (brakes, engines, landing gear, etc.). Extending it to handle perception is challenging. In particular, the perceptual system must accurately quantify its uncertainty so that the controller can act cautiously under uncertainty. For aleatoric uncertainty (i.e., in-distribution), this is a solved problem. But for epistemic uncertainty, there are still fundamental challenges. As you are well aware, epistemic uncertainty encompasses out-of-distribution / anomaly / novelty detection. These methods rely on having a representation that separates the unknowns from the knowns -- the anomalies from the nominals. Given such separation, flexible density estimators can do the job of detecting novelty. However, I believe it is impossible to provide guarantees for such representations. The best we can do is ensure that the representations capture all known environmental variability, and for this, vision foundation models are a practical way forward. But there may be other technical advances that can strengthen the learned representations.

Another challenge goes beyond perception to prediction. You discussed predicting the behavior of other cars, but what about pedestrians, dogs, cats, toddlers, deer, kangaroos, etc.? New modes of personal transportation are marketed all the time, and the future trajectories of people using those modes are highly variable. This is another place where epistemic uncertainty must be quantified.

Bottom line: Safety in an open world cannot be proved, because it relies on perfect epistemic uncertainty quantification, which is impossible. 

Finally, I highly recommend Nancy Leveson's book, Engineering a Safer World. She approaches safety as a socio-technical control problem. It is not sufficient to design and prove that a system is safe. Rather, safety is a property that must be actively maintained against disturbances. This requires post-deployment monitoring, detection of anomalies and near misses, diagnosis of the causes of those anomalies and near missses, and updates to the system design to accommodate them. One form of disturbance is novelty, as described above. But the biggest threat is often budget cuts, staff layoffs, and staff turnover in the organization responsible for post-deployment maintenance. She documents a recurring phenomenon -- that complex systems tend to migrate toward increased risk because of these disturbances. A challenge for all of us is to figure out how to maintain robust socio-technical systems that in turn can maintain the safety of deployed systems. 

--Tom Dietterich

I agree that "safety in an open world cannot be proved," at least as a general claim, but disagree that this impinges on the narrow challenge of designing cars that do not cause accidents - a misunderstanding which I tried to be clear about, but which I evidently failed to make sufficiently clear, as Oliver's misunderstanding illustrates. That said, I strongly agree that better methods for representing grain of truth problems, and considering hypotheses outside those which are in the model is critical. It's a key reason I'm supporting work on infra-Bayesian approaches, which are designed explicitly to handle this class of problem. Again, it's not necessary for the very narrow challenge I think I addressed above, but I certainly agree that it's critical.

 

Second, I'm a huge proponent of complex system engineering approaches, and have discussed this in previous unrelated work. I certainly agree that these issues are critical, and should receive more attention - but I think it's counterproductive to try to embed difficult problems inside of addressable ones. To offer an analogy, creating provably safe code that isn't vulnerable to any known technical exploit still will not prevent social engineering attacks, but we can still accomplish the narrow goal.

If, instead of writing code that can't be fuzzed for vulnerabilities, doesn't contain buffer overflow or null-pointer vulnerabilities, and can't be exploited via transient execution CPU vulnerabilities, and isn't vulnerable to rowhammer attacks, you say that we need to address social engineering before trying to make the code provably safe, and should address social engineering with provable properties, you're sabotaging progress in a tractable area in order to apply a paradigm ill-suited to the new problem you're concerned with.

That's why, in this piece, I started by saying I wasn't proving anything general, and "I am making far narrower claims than the general ones which have been debated." I agree that the larger points are critical. But for now, I wanted to make a simpler point.

I enthusiastically support doing everything we can to reduce vulnerabilities. As I indicated, a foundation model approach to representation learning would greatly improve the ability of systems to detect novelty. Perhaps we should rename the "provable safety" area as "provable safety modulo assumptions" area and be very explicit about our assumptions. We can then measure progress by the extent to which we can shrink those assumptions. 

rename the "provable safety" area as "provable safety modulo assumptions" area and be very explicit about our assumptions.

Very much agree. I gave some feedback along those lines as the term was coined; and am sad it didn't catch on. But of course "provable safety modulo assumptions" isn't very short and catchy...

I do like the word "guarantee" as a substitute. We can talk of formal guarantees, but also of a store guaranteeing that an item you buy will meet a certain standard. So it's connotations are nicely in the direction of proof but without, as it were, "proving too much" :)

That seems fair!

I really like that idea, and the clarity it provides, and have renamed the post to reflect it! (Sorryr this was so slow- I'm travelling.)

[-]habryka2113

Responsibility Sensitive Safety (RSS) formalizes what a car needs to do not to cause accidents. That is, if a car drives safely, any accidents which could occur will be attributable to other cars actions. In the case of RSS, it’s provable that if other cars follow the law, and/or all cars on the road abide by the rules, those cars will only have accidents if their sensors are incorrect.

This seems like a total misrepresentation. RSS seems miles away from anything that one could describe as a formalization of how to avoid an accident. It doesn't even mention pedestrians, it's definitions are hilariously oversimplified, and these rules clearly don't model the vast majority of driving behaviors and causes of driving problems.

It's of course not provable that cars that follow RSS only have accidents if their sensors are broken. Maybe the road is too slippery and the system's estimates of how much breaking time it needs is inaccurate. Maybe the sensors are all accurate but the system you proved things about didn't correctly parse the layout of the roads. Maybe there is an obtrusion and now cars have to navigate a outside of a road environment for a few meters.

I really would like to give this provable AI stuff a chance, but I do feel pretty irritated when posts state such obvious falsehoods as straightforward facts. I am really confused what is going on when people say statements like the above with a straight-face, I feel like even a cursory glance at any of the RSS stuff would have made it clear that this is not anywhere close to a formalization of safe driving with the attributes you claim it has.

Have you reviewed the paper? (It is the first link under "The RSS Concept" in the page which was linked to before, though perhaps I should have linked to it directly.) It seems to lay out the proof, and discusses pedestrians, and deals with most of the objections you're raising, including obstructions and driving off of marked roads. I admit I have not worked through the proof in detail, but I have read through it, and my understanding is that it was accepted, and a large literature has been built that extends it.

And the objections about slippery roads and braking are the set of things I noted under "traditional engineering analysis and failure rates" I agree that guarantees are non-trivial, but they also aren't outside of what is done already in safety analysis, and there is explicit work in the literature on the issue, both from the verification and validation side, and from the perception and sensing weather conditions side.

I looked at a bunch of the articles and rules listed on the website but didn't review the paper in-depth. The rules don't mention pedestrians, but you are right that the paper does has an extremely cursory treatment of them. It basically says "assume the pedestrians are also following RSS", which of course, is in direct contradiction to your statement that if all cars on the road follow RSS, that there will be no accidents, since pedestrians are of course part of the road environment, and they do not follow RSS in-reality (and as such RSS cannot prove that no collisions with pedestrians will occur, or that the only way to do a reasonable thing when interfacing with pedestrians will require some collision with another vehicle).

Look, I think the paper makes some valid proofs, but it is completely (and IMO obviously) inaccurate to say that RSS proves that if all cars follow RSS, that no accidents will take place. 

What it of course proves, as all formal system analysis does, is that in an extremely simplified toy world, that no such accidents occur. The applicability of that toy world to the analysis of real situations is then of course a classical problem that engineering professionals face every day. But that transfer is the whole central issue at stake in arguing for "provably safe AI" and so this article is absolutely not the place to just glance over the extremely difficult and crucial step of translating a form toy world like RSS into real world outcomes, and in this case you straightforwardly made an inaccurate unqualified statement.

To start at the end, you claim I "straightforwardly made an inaccurate unqualified statement," but replaced  my statement about "what a car needs to do not to cause accidents" with "no accidents will take place." And I certainly agree that there is an "extremely difficult and crucial step of translating a form toy world like RSS into real world outcomes," but the toy model that the paper is dealing with is therefore one of rule-following entities, both pedestrians and cars. That's why it's not going to require accounting for "what if pedestrians do something illegal and unexpected."

Of course, I agree that this drastically limits the proof, or as I said initially, "relying on assumptions about other car behavior is a limit to provable safety," but you seem to insist that because the proof doesn't do something I never claimed it did, it's glossing over something.

That said, I agree that I did not discuss pedestrians, but as you sort-of admit, the paper does - it treats stationary pedestrians not at crosswalks, and not on sidewalks, as largely unpredictable entities that may enter the road. For example, it notes that "even if pedestrians do not have priority, if they entered the road at a safe distance, cars must brake and let them pass." But again, you're glossing over the critical assumption for the entire section, which is responsibility for accidents. And this is particularly critical; the claim is not that pedestrians and other cars cannot cause accidents, but that the safe car will not do so.

 

Given all of that, to get back to the beginning, your initial position was that "RSS seems miles away from anything that one could describe as a formalization of how to avoid an accident." Do you agree that it's close to "a formalization of how to avoid causing an accident"?

To start at the end, you claim I "straightforwardly made an inaccurate unqualified statement," but replaced  my statement about "what a car needs to do not to cause accidents" with "no accidents will take place."

I don't think I msisquoted you. You said: 

In the case of RSS, it’s provable that if other cars follow the law, and/or all cars on the road abide by the rules, those cars will only have accidents if their sensors are incorrect.

You did not say "cause" you said "have". I did read your words carefully. You said all accidents would be the result of sensor-defects. This means that if there are no sensor-defects, no accidents will take place. That's a straightforwardly inaccurate statement about what RSS proves.

I don't think either of us inherently cares about "responsibility". I think the theme of that section (as well as other sections) was pretty clearly that we could formally prove certain outcomes impossible. 

"Causing" is a confusing word. I don't think RSS proves at all that a car could not cause an accident. I haven't engaged with the formalism in-detail, but as far as I can tell it would totally be compatible with a car driving extremely recklessly in a pedestrian environment due to making assumptions about pedestrian behavior that are not accurate. If you really care, I could formally give you an example given its modeling assumptions about pedestrians, but I am pretty sure you could do the same.

Yes, after saying it was about what they need "to do not to cause accidents" and that "any accidents which could occur will be attributable to other cars actions," which I then added caveats to regarding pedestrians, I said "will only have accidents" when I should have said "will only cause accidents." I have fixed that with another edit. But I think you're confused about what I'm trying to show .
 

Principally, I think you are wrong about what needs to be shown here for safety in the sense I outlined, or are trying to say that the sense I outlined doesn't lead to something I don't claim. If what is needed in order for a system to be safe is that no damage will be caused in situations which involve the system, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes. My claim, on the other hand, is that in sociotechnological systems, the way that safety is achieved is by creating guarantees that each actor - human or AI - behaves according to rules that minimizes foreseeable dangers. That would include safeguards for stupid, malicious, or dangerous human actions, much like human systems have laws about dangerous actions. However, in a domain like driving, in the same way that it's impossible for human drivers to both get where they are going, and never hit pedestrians who act erratically and jump out from behind obstacles into the road with an oncoming car, a safe autonomous vehicle wouldn't be expected to solve every possible case of human misbehavior - just to drive responsibly.

More specifically, you make the claim that "as far as I can tell it would totally be compatible with a car driving extremely recklessly in a pedestrian environment due to making assumptions about pedestrian behavior that are not accurate." The paper, on the other hand, says "For example, in a typical residential street, a pedestrian has the priority over the vehicles, and it follows that vehicles must yield and be cautious with respect to pedestrians," and formalizes this with statements like "a vehicle must be in a kinematic state such that if it will apply a proper response (acceleration for ρ seconds and when braking) it will remain outside of a ball of radius 50cm around the pedestrian." 

I also think that it formalizes reasonable behavior for pedestrians, but I agree that it won't cover every case - pedestrians oblivious to cars that are driving in ways that are otherwise safe, who rapidly change their path to jump in front of cars, are sometimes able to be hit by those cars - but I think fault is pretty clear here. (And the paper is clear that even in those cases, the car would need to both drive safely in residential areas, and attempt to brake or avoid the pedestrian in order to avoid crashes even in cases with irresponsible and erratic humans!)

But again, as I said initially, this isn't solving the general case of AI safety, it's solving a much narrower problem. And if you wanted to make the case that this isn't enough for similar scenarios that we care about, I will strongly agree that for more capable systems, the set of situations it would need to avoid are correspondingly larger, and the set of necessary guarantees are far stronger. But as I said at the beginning, I'm not making that argument - just the much simpler one that proveability can work in physical systems, and can be applied in sociotechnological systems in ways that make sense.

The crux of these types of arguments seems to be conflating the provable safety of an agent in a system with the expectation of absolute safety. In my experience, this is the norm, not the exception, and needs to be explicitly addressed.

In agreement with what you posted above, I think it is formally trivial to construct a scenario in which a pedestrian jumps in front of a car, making it provably impossible for a vehicle to stop in time to avoid a collision using high school physics. 

Likewise, I have the intuition that AI safety, in general, should have various "no-go theorems" about unprovability outside a reasonable problem scope or that finding such proofs would be np-hard or worse. If you know of any specific results( outside of general computability theory) , could you please share them? It would be nice if the community could avoid falling into the trap of trying to prove too much. 


(Sorry if this isn't the correct location for this post.)

 

On the absolute safety, I very much like the way you put it, and will likely use that framing in the future, so thanks!

On impossibility results, there are some, andI definitely think that this is a good question, but also agree this isn't quite the right place to ask. I'd suggest talking to some of the agents foundations people for suggestions

I think the "Provably Safe ML" section is my main crux. For example, you write:

One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver - it just keeps such systems in a provably safe box.

I currently believe that if you try to do this, you will either have to restrict the outputs so much that the car wouldn't be able to drive well, or else fail to prove that the actions allowed by the gate are safe. Perhaps you can elaborate on why this approach seems like it could work?

(I feel similarly about other proposals in that section.)

I agree that it's the most challenging part, and there are unsolved problems, but I don't share your intuition that it's in some way unsolvable, so I suspect we're thinking of very different types of things.

For RSS specifically, Rule 5 is obviously the most challenging, but it's also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules - respecting distance and not hitting things - are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn't changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily. 

yes, in some cases a much weaker (because it's constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)

I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that's not true for proposals like safeguarded AI, where a proof must accompany the output, and it's not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.

there's steganography, you'd need to limit total bits not accounted for by the gating system or something to remove them

I partly disagree; steganography is only useful when it's possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well. 

That said, I'd be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar - it definitely seems like something that @davidad would want to consider.

Yes, I am indeed thinking about this.

  1. Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
  2. It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
  3. If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
  4. But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
  5. Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.

I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like  where  is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.

Hear Hear! Beautifully written! Safety is a systems issue and if components at any level have vulnerabilities, they can spread to other parts of the system. As you write, we already have effective techniques for formal verification at many of the lower levels: software, OS, chips, etc.  And, as you say, the AI level and social levels are more challenging. We don't yet have a detailed understanding of transformer-based controllers and methods like RLHF diminish but don't eliminate harmful behaviors. But if we can choose precise coarse safety criteria, then formal gatekeepers between the AI and the actuators can ensure those constraints are obeyed.  

At the social level, you mention legality and violation of laws. Those work well for humans because humans care about being fined or jailed. I think the AI situation is more challenging because it may be unclear what the source of an AI involved in a harmful event was. If a malicious human actor or AI can spin off harmful AI agents anonymously, then after-the-fact punishment doesn't serve as a deterrent. That suggests that we need to both create accountability for the provenance of AIs and to harden infrastructure so that even malicious AIs can't cause harmful outcomes. I believe that both of those can be accomplished using formally verified techniques like Max's and my "provable contracts". 

But there is clearly a lot of work to be done! What are the next steps and how long do we have? I've been waiting for LLMs to reach proficiency at theorem proving, verified software synthesis, and autoformalization. Companies like DeepSeek (https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 ), Harmonic (https://harmonic.fun/news ) and xAI (https://x.ai/blog ) are making rapid progress on each of these capabilities. But OpenAI's release of o1 this week may be the enabling event I've been waiting for. For example, check out mathematician Terence Tao's experience with it: https://mathstodon.xyz/@tao/113132502735585408 "I could imagine a model of this capability that was specifically finetuned on Lean and Mathlib, and integrated into an IDE, being extremely useful in formalization projects."

Of course, those new abilities also enable new capability levels for cyberattack and social manipulation. So the urgency for hardening today's infrastructure may happen earlier than many of us expected. That suggests that we may need to rapidly institute coarse protective measures before we can build out the truly human-beneficial infrastructure we ultimately want. But the sooner we can build precise models of our systems, their failure modes, and the full range of adversarial attacks, the less harm humanity will have to suffer.

David, I really like that your description discusses the multiple interacting levels involved in self-driving cars (eg. software, hardware, road rules, laws, etc.). Actual safety requires reasoning about those interacting levels and ensuring that the system as a whole doesn't have vulnerabilities. Malicious humans and AIs are likely to try to exploit systems in unfamiliar ways. For example, here are 10 potential harmful actions (out of many many more possibilities!) that AI-enabled malicious humans or AIs could gain benefits from which involve controlling self-driving cars in harmful ways:

1) Murder for hire: Cause accidents that kill a car's occupant, another car's occupant, or pedestrians

2) Terror for hire: Cause accidents that injure groups of pedestrians

3) Terror for hire: Control vehicles to deliver explosives or disperse pathogens

4) Extortion: Demand payment to not destroy a car, harm the driver, or harm others

5) Steal delivery contents: Take over delivery vehicles to steal their contents

6) Steal car components: Take over cars to steal components like catalytic converters

7) Surreptitious delivery or taxis: Control vehicles to earn money for illegal deliveries or use in illegal activities

8) Insurance fraud for hire: Cause damage to property for insurance fraud

9) Societal distraction for hire: Cause multiple crashes to overwhelm the societal response

10) Blocked roads for hire: Control multiple vehicles to block roadways for social harm or extortion

To prevent these and many related harms, safety analyses must integrate analyses on multiple levels. We need formal models of the relevant dynamics of each level and proof of adherence to overall safety criteria. For societal trust in the actual safety of a self-driving vehicle, I believe manufacturers will need to provide machine-checkable "certificates" of these analyses.

I think these are all really great things that we could formalize and build guarantees around. I think some of them are already ruled out by the responsibility sensitive safety guarantees, but others certainly are not. On the other hand, I don't think that use of cars to do things that violate laws completely unrelated to vehicle behavior are in scope; similar to what I mentioned to Oliver,  if what is needed in order for a system to be safe is that nothing bad can be done, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes.

But in cases where provable safety guarantees are in place, and the issues relate to car behavior - such as cars causing damage, blocking roads, or being redirected away from the intended destination - I think hardware guarantees on the system, combined with software guarantees, combined with verifying that only trusted code is being run, could be used to ignition-lock cars which have been subverted.

And I think that in the remainder of cases, where cars are being used for dangerous or illegal purposes, we need to trade off freedom and safety. I certainly don't want AI systems which can conspire to break the law - and in most cases, I expect that this is something LLMs can already detect - but I also don't want a car which will not run if it determines that a passenger is guilty of some unrelated crime like theft. But for things like "deliver explosives or disperse pathogens," I think vehicle safety is the wrong path to preventing dangerous behavior; it seems far more reasonable to have separate systems that detect terrorism, and separate types of guarantees to ensure LLMs don't enable that type of behavior.

(Disclaimer: I'm not very knowledgeable about safety engineering or formal proofs)

I notice that whenever someone brings up "What if this unexpected thing happens?", you emphasize that it's about not causing accidents. I'm worried that it's hard to define exactly who caused an accident, for the same reason that deciding who's liable in the legal system is hard.

It seems quite easy to say that the person who sabotaged the stop sign was at fault for the accident. What if the saboteur poured oil on the road instead? Is it their fault if the car crashes from sliding on the oil? Okay, they're being malicious, so they're at fault. But what if the oil spill was an accident from a truck tipping over? Is it the truck driver's fault? What if the road was slippery because of ice? Nobody causes the weather, right? On the contrary: the city could've cleared and salted the roads earlier, but they didn't. In the counterfactual world where they did it earlier, the accident wouldn't have happened.

Okay, how about instead of backward chaining forever, we just check whether the system could have avoided the accident in the counterfactual where it took different actions. The problem is: even in the case where an adversarial stop sign leads to the car crashing, the system potentially could've avoided it. Stop signs are placed by humans somewhat arbitrarily using heuristics to determine if an intersection is risky. Shouldn't the system be able to tell that an intersection is risky, even when there truly isn't a stop sign there?

The paper tackles the problem by formalizing which behaviors and assumptions regarding the movement of cars and pedestrians are "reasonable" or "unreasonable", then proving within the toy model that only unreasonable behavior leads to crashes. Makes sense, but in the real world people don't just follow paths, they do all kinds of things that influence the world. Wouldn't the legal system be simple if we could just use equations like these to determine liability? I'm just not sure we should expect to eventually cover the long tail of potential situations sufficiently enough to make "provably safe" meaningful.

Also, I'm concerned because they don't seem to describe it as a toy model despite the extremely simplified set of things they're considering, and there might be questionable incentives at play to make it seem more sound than it is. From another document on their website:

We believe the industry can come together to create a collaborative platform which provides a “safety seal” that first and foremost will create a safer product, but at the same time will protect OEMs from unreasonable and unwarranted liability assignment by regulators and society.

So they want the cars to be safe, but they also want to avoid liability by proving the accident was someone else's fault.