Aur Saraf hosted a 3-hour MIRIx meeting on December 18. Yoav Hollander, chip verification pioneer and creator of the e verification language, was there, as well as MIRI Research Associate Vadim Kosoy. Also in attendance were Benjamin Fox, Ofer Mustigman, Matan Mor, Aur Saraf, Eli Sennesh, and Joshua Fox.
Our discussion had its roots in Eliezer Yudkowsky’s 2008 article, in which he suggested that FAI researchers take an example from “computer engineers prov[ing] a chip valid”. Yet, as Yoav pointed out (in a lecture at LessWrong Tel Aviv in October 2015), there are strong limitations on formal verification at all levels, from the smallest arithmetic component on a chip up to entire systems like an airport. Even something as simple as floating-point division can barely be proven formally; as we move up to higher levels of complexity, any formal proofs always rest on very tightly constrained specifications on the input and environment. It is impossible to prove even a tiny part of the full range of relevant predicates.
Formal verification has a role, but most verification is done dynamically, for example by running a simulation against test cases. The goal of this meeting was to come up with a list of directions for applying ideas from the verification world to FAI research.
The state of FAI research
Vadim Kosoy described the state of the art in FAI research, catching us up on the last few years of MIRI’s work. FAI research can be divided into three levels: Modules, optimization under self improvement, and the selection of a human-compatible goal function.
I. Most basically, we can verify lower level modules that can make up an AGI.
II. Second -- and this is most of the research effort -- we can make sure that future AIs optimize for a given implicit human-compatible goal function, even as they grow in strength.
MIRI is focused on accomplishing this with verifiable goal preservation under self-improvement. Some other ideas include:
Agents that are deliberately kept weak.
Limited intelligence AIs evaluated on their mathematical ability, but with no knowledge of physics or our real world. (Such AIs might not be strong in induction given real-world physics, but at least this evaluation procedure might allow the relatively safe development of a certain kind of AI.)
AIs locked in cryptographic boxes. They run with homomorphic encryption that prevents any side effects of their computation from being revealed to the outside world: Only their defined output can reach us.
Such an AI could still accomplish a lot, while keeping potentially dangerous information from us. As an illustration, you might ask it to prove the Riemann Hypothesis, also passing in a proof verifier. Operating under the protection of homomorphic encryption, the AI might find a proof for the Riemann Hypothesis and feeds it to the proof verifier. It outputs a single bit “Yes, there is a proof to the Riemann Hypothesis,” but it never shows us the proof.
Negative utility for any act of self-analysis.
Corrigibility: Ensuring that the AI will allow us to turn it off if we so wish, typically by carefully defining its utility function.
III. The third area of FAI research is in choosing a goal function that matches human values, or, less ambitiously, a function that has some characteristics that match human values.
Verification for Autonomous Robots
Yoav Hollander asked to focus on autonomous robots like drones and self-driving cars, given the extreme difficulties in verification for AGIs--while fully recognizing the different natures of the challenges.
The state of the art in validating safety for these systems is pretty basic. There is some work on formal verification of some aspects of autonomous robots (e.g. here), and some initial work on dynamic, coverage-driven verification (e.g. here). The most advanced work in autonomous vehicle verification consists of dynamic verification of the entire software stack on automatically generated scenarios. These scenarios are based on recordings of video, LIDAR and other sensors taken while driving real roads; interesting events like pedestrians jumping on the road are superimposed on these.
An important model for robot behavior is “Belief, Desire, Intention” (BDI), which is expressed in the AgentSpeak language (among others). The Jason Java-based interpreter (among others) then execute these behavioral models.
We can connect the three areas of FAI research (above) to the work of autonomous robot engineers:
Analyzing the modules is a good idea, though figuring out what the modules would be in an FAI is not easy..
Optimizing goal functions according to our true intentions--Do What I Mean--is relevant for autonomous robots just as it would be for an AGI.
Choosing a utility function looks a bit more difficult if we don't have the AI-under-test to output even a half-page description of its preferences that humans would read. There is no clear way to identify unexpected perversities fully automatically, even in the limited autonomous robots of today.
Safety Tools for AI Development
Aur Saraf suggested a software tool that checks a proposed utility function for ten simple perversities. Researchers are urged to run it before pressing the Big Red Button.
After a few runs, the researcher would start to abandon their facile assumptions about the safety of their AI. For example, the researcher runs it and learns that “Your AI would have turned us all into paperclips”. The researcher fixes the problem, runs the tool and learns “Your improved AI would have cryogenically frozen all us.” Again, the researcher fixes the AI and runs the tool, which answers “Your twice-fixed AI would have turned the universe into a huge blob of molten gold.” At this point, maybe they would start realizing the danger.
If we can create an industry-standard “JUnit for AGI testing,” we can then distribute safety testing as part of this AGI. The real danger in AGI, as Vadim pointed out, is that a developer has a light finger on the Run button while developing the AI. (“Let’s just run it for a few minutes, see what happens") A generic test harness for AGI testing, which achieves widespread distribution and might be a bother for others to reimplement, could then be a platform for ensuring much more awareness and care about safety among AI developers, as per the “perversity checker” mentioned by Aur.
________
Notes:
Here is Yoav’s summary of his opinions on this, as formed after the meeting.
Thanks to Aur Saraf for the notes on which this is based and to other participants for comments. Any flaws are mine.
Aur Saraf hosted a 3-hour MIRIx meeting on December 18. Yoav Hollander, chip verification pioneer and creator of the e verification language, was there, as well as MIRI Research Associate Vadim Kosoy. Also in attendance were Benjamin Fox, Ofer Mustigman, Matan Mor, Aur Saraf, Eli Sennesh, and Joshua Fox.
Our discussion had its roots in Eliezer Yudkowsky’s 2008 article, in which he suggested that FAI researchers take an example from “computer engineers prov[ing] a chip valid”. Yet, as Yoav pointed out (in a lecture at LessWrong Tel Aviv in October 2015), there are strong limitations on formal verification at all levels, from the smallest arithmetic component on a chip up to entire systems like an airport. Even something as simple as floating-point division can barely be proven formally; as we move up to higher levels of complexity, any formal proofs always rest on very tightly constrained specifications on the input and environment. It is impossible to prove even a tiny part of the full range of relevant predicates.
Formal verification has a role, but most verification is done dynamically, for example by running a simulation against test cases. The goal of this meeting was to come up with a list of directions for applying ideas from the verification world to FAI research.
The state of FAI research
Vadim Kosoy described the state of the art in FAI research, catching us up on the last few years of MIRI’s work. FAI research can be divided into three levels: Modules, optimization under self improvement, and the selection of a human-compatible goal function.
I. Most basically, we can verify lower level modules that can make up an AGI.
II. Second -- and this is most of the research effort -- we can make sure that future AIs optimize for a given implicit human-compatible goal function, even as they grow in strength.
MIRI is focused on accomplishing this with verifiable goal preservation under self-improvement. Some other ideas include:
Such an AI could still accomplish a lot, while keeping potentially dangerous information from us. As an illustration, you might ask it to prove the Riemann Hypothesis, also passing in a proof verifier. Operating under the protection of homomorphic encryption, the AI might find a proof for the Riemann Hypothesis and feeds it to the proof verifier. It outputs a single bit “Yes, there is a proof to the Riemann Hypothesis,” but it never shows us the proof.
III. The third area of FAI research is in choosing a goal function that matches human values, or, less ambitiously, a function that has some characteristics that match human values.
Verification for Autonomous Robots
Yoav Hollander asked to focus on autonomous robots like drones and self-driving cars, given the extreme difficulties in verification for AGIs--while fully recognizing the different natures of the challenges.
The state of the art in validating safety for these systems is pretty basic. There is some work on formal verification of some aspects of autonomous robots (e.g. here), and some initial work on dynamic, coverage-driven verification (e.g. here). The most advanced work in autonomous vehicle verification consists of dynamic verification of the entire software stack on automatically generated scenarios. These scenarios are based on recordings of video, LIDAR and other sensors taken while driving real roads; interesting events like pedestrians jumping on the road are superimposed on these.
An important model for robot behavior is “Belief, Desire, Intention” (BDI), which is expressed in the AgentSpeak language (among others). The Jason Java-based interpreter (among others) then execute these behavioral models.
We can connect the three areas of FAI research (above) to the work of autonomous robot engineers:
Analyzing the modules is a good idea, though figuring out what the modules would be in an FAI is not easy..
Optimizing goal functions according to our true intentions--Do What I Mean--is relevant for autonomous robots just as it would be for an AGI.
Choosing a utility function looks a bit more difficult if we don't have the AI-under-test to output even a half-page description of its preferences that humans would read. There is no clear way to identify unexpected perversities fully automatically, even in the limited autonomous robots of today.
Safety Tools for AI Development
Aur Saraf suggested a software tool that checks a proposed utility function for ten simple perversities. Researchers are urged to run it before pressing the Big Red Button.
After a few runs, the researcher would start to abandon their facile assumptions about the safety of their AI. For example, the researcher runs it and learns that “Your AI would have turned us all into paperclips”. The researcher fixes the problem, runs the tool and learns “Your improved AI would have cryogenically frozen all us.” Again, the researcher fixes the AI and runs the tool, which answers “Your twice-fixed AI would have turned the universe into a huge blob of molten gold.” At this point, maybe they would start realizing the danger.
If we can create an industry-standard “JUnit for AGI testing,” we can then distribute safety testing as part of this AGI. The real danger in AGI, as Vadim pointed out, is that a developer has a light finger on the Run button while developing the AI. (“Let’s just run it for a few minutes, see what happens") A generic test harness for AGI testing, which achieves widespread distribution and might be a bother for others to reimplement, could then be a platform for ensuring much more awareness and care about safety among AI developers, as per the “perversity checker” mentioned by Aur.
________
Notes:
Here is Yoav’s summary of his opinions on this, as formed after the meeting.
Thanks to Aur Saraf for the notes on which this is based and to other participants for comments. Any flaws are mine.