Earning money with/for work in AI safety
(I'm re-posting my question from the Welcome thread, because nobody answered there.)
I care about the current and future state of humanity, so I think it's good to work on existential or global catastrophic risk. Since I've studied computer science at a university until last year, I decided to work on AI safety. Currently I'm a research student at Kagoshima University doing exactly that. Before April this year I had only little experience with AI or ML. Therefore, I'm slowly digging through books and articles in order to be able to do research.
I'm living off my savings. My research student time will end in March 2017 and my savings will run out some time after that. Nevertheless, I want to continue AI safety research, or at least work on X or GC risk.
I see three ways of doing this:
- Continue full-time research and get paid/funded by someone.
- Continue research part-time and work the other part of the time in order to get money. This work would most likely be programming (since I like it and am good at it). I would prefer work that helps humanity effectively.
- Work full-time on something that helps humanity effectively.
Oh, and I need to be location-independent or based in Kagoshima.
I know http://futureoflife.org/job-postings/, but all of the job postings fail me in two ways: not location-independent and requiring more/different experience than I have.
Can anyone here help me? If yes, I would be happy to provide more information about myself.
(Note that I think I'm not in a precarious situation, because I would be able to get a remote software development job fairly easily. Just not in AI safety or X or GC risk.)
[LINK] Concrete problems in AI safety
From the Google Research blog:
We believe that AI technologies are likely to be overwhelmingly useful and beneficial for humanity. But part of being a responsible steward of any new technology is thinking through potential challenges and how best to address any associated risks. So today we’re publishing a technical paper, Concrete Problems in AI Safety, a collaboration among scientists at Google, OpenAI, Stanford and Berkeley.
While possible AI safety risks have received a lot of public attention, most previous discussion has been very hypothetical and speculative. We believe it’s essential to ground concerns in real machine learning research, and to start developing practical approaches for engineering AI systems that operate safely and reliably.
We’ve outlined five problems we think will be very important as we apply AI in more general circumstances. These are all forward thinking, long-term research questions -- minor issues today, but important to address for future systems:
- Avoiding Negative Side Effects: How can we ensure that an AI system will not disturb its environment in negative ways while pursuing its goals, e.g. a cleaning robot knocking over a vase because it can clean faster by doing so?
- Avoiding Reward Hacking: How can we avoid gaming of the reward function? For example, we don’t want this cleaning robot simply covering over messes with materials it can’t see through.
- Scalable Oversight: How can we efficiently ensure that a given AI system respects aspects of the objective that are too expensive to be frequently evaluated during training? For example, if an AI system gets human feedback as it performs a task, it needs to use that feedback efficiently because asking too often would be annoying.
- Safe Exploration: How do we ensure that an AI system doesn’t make exploratory moves with very negative repercussions? For example, maybe a cleaning robot should experiment with mopping strategies, but clearly it shouldn’t try putting a wet mop in an electrical outlet.
- Robustness to Distributional Shift: How do we ensure that an AI system recognizes, and behaves robustly, when it’s in an environment very different from its training environment? For example, heuristics learned for a factory workfloor may not be safe enough for an office.
We go into more technical detail in the paper. The machine learning research community has already thought quite a bit about most of these problems and many related issues, but we think there’s a lot more work to be done.
We believe in rigorous, open, cross-institution work on how to build machine learning systems that work as intended. We’re eager to continue our collaborations with other research groups to make positive progress on AI.
Notes on the Safety in Artificial Intelligence conference
These are my notes and observations after attending the Safety in Artificial Intelligence (SafArtInt) conference, which was co-hosted by the White House Office of Science and Technology Policy and Carnegie Mellon University on June 27 and 28. This isn't an organized summary of the content of the conference; rather, it's a selection of points which are relevant to the control problem. As a result, it suffers from selection bias: it looks like superintelligence and control-problem-relevant issues were discussed frequently, when in reality those issues were discussed less and I didn't write much about the more mundane parts.
SafArtInt has been the third out of a planned series of four conferences. The purpose of the conference series was twofold: the OSTP wanted to get other parts of the government moving on AI issues, and they also wanted to inform public opinion.
The other three conferences are about near term legal, social, and economic issues of AI. SafArtInt was about near term safety and reliability in AI systems. It was effectively the brainchild of Dr. Ed Felten, the deputy U.S. chief technology officer for the White House, who came up with the idea for it last year. CMU is a top computer science university and many of their own researchers attended, as well as some students. There were also researchers from other universities, some people from private sector AI including both Silicon Valley and government contracting, government researchers and policymakers from groups such as DARPA and NASA, a few people from the military/DoD, and a few control problem researchers. As far as I could tell, everyone except a few university researchers were from the U.S., although I did not meet many people. There were about 70-100 people watching the presentations at any given time, and I had conversations with about twelve of the people who were not affiliated with existential risk organizations, as well as of course all of those who were affiliated. The conference was split with a few presentations on the 27th and the majority of presentations on the 28th. Not everyone was there for both days.
Felten believes that neither "robot apocalypses" nor "mass unemployment" are likely. It soon became apparent that the majority of others present at the conference felt the same way with regard to superintelligence. The general intention among researchers and policymakers at the conference could be summarized as follows: we need to make sure that the AI systems we develop in the near future will not be responsible for any accidents, because if accidents do happen then they will spark public fears about AI, which would lead to a dearth of funding for AI research and an inability to realize the corresponding social and economic benefits. Of course, that doesn't change the fact that they strongly care about safety in its own right and have significant pragmatic needs for robust and reliable AI systems.
Most of the talks were about verification and reliability in modern day AI systems. So they were concerned with AI systems that would give poor results or be unreliable in the narrow domains where they are being applied in the near future. They mostly focused on "safety-critical" systems, where failure of an AI program would result in serious negative consequences: automated vehicles were a common topic of interest, as well as the use of AI in healthcare systems. A recurring theme was that we have to be more rigorous in demonstrating safety and do actual hazard analyses on AI systems, and another was that we need the AI safety field to succeed in ways that the cybersecurity field has failed. Another general belief was that long term AI safety, such as concerns about the ability of humans to control AIs, was not a serious issue.
On average, the presentations were moderately technical. They were mostly focused on machine learning systems, although there was significant discussion of cybersecurity techniques.
The first talk was given by Eric Horvitz of Microsoft. He discussed some approaches for pushing into new directions in AI safety. Instead of merely trying to reduce the errors spotted according to one model, we should look out for "unknown unknowns" by stacking models and looking at problems which appear on any of them, a theme which would be presented by other researchers as well in later presentations. He discussed optimization under uncertain parameters, sensitivity analysis to uncertain parameters, and 'wireheading' or short-circuiting of reinforcement learning systems (which he believes can be guarded against by using 'reflective analysis'). Finally, he brought up the concerns about superintelligence, which sparked amused reactions in the audience. He said that scientists should address concerns about superintelligence, which he aptly described as the 'elephant in the room', noting that it was the reason that some people were at the conference. He said that scientists will have to engage with public concerns, while also noting that there were experts who were worried about superintelligence and that there would have to be engagement with the experts' concerns. He did not comment on whether he believed that these concerns were reasonable or not.
An issue which came up in the Q&A afterwards was that we need to deal with mis-structured utility functions in AI, because it is often the case that the specific tradeoffs and utilities which humans claim to value often lead to results which the humans don't like. So we need to have structural uncertainty about our utility models. The difficulty of finding good objective functions for AIs would eventually be discussed in many other presentations as well.
The next talk was given by Andrew Moore of Carnegie Mellon University, who claimed that his talk represented the consensus of computer scientists at the school. He claimed that the stakes of AI safety were very high - namely, that AI has the capability to save many people's lives in the near future, but if there are any accidents involving AI then public fears could lead to freezes in AI research and development. He highlighted the public's irrational tendencies wherein a single accident could cause people to overlook and ignore hundreds of invisible lives saved. He specifically mentioned a 12-24 month timeframe for these issues.
Moore said that verification of AI system safety will be difficult due to the combinatorial explosion of AI behaviors. He talked about meta-machine-learning as a solution to this, something which is being investigated under the direction of Lawrence Schuette at the Office of Naval Research. Moore also said that military AI systems require high verification standards and that development timelines for these systems are long. He talked about two different approaches to AI safety, stochastic testing and theorem proving - the process of doing the latter often leads to the discovery of unsafe edge cases.
He also discussed AI ethics, giving an example 'trolley problem' where AI cars would have to choose whether to hit a deer in order to provide a slightly higher probability of survival for the human driver. He said that we would need hash-defined constants to tell vehicle AIs how many deer a human is worth. He also said that we would need to find compromises in death-pleasantry tradeoffs, for instance where the safety of self-driving cars depends on the speed and routes on which they are driven. He compared the issue to civil engineering where engineers have to operate with an assumption about how much money they would spend to save a human life.
He concluded by saying that we need policymakers, company executives, scientists, and startups to all be involved in AI safety. He said that the research community stands to gain or lose together, and that there is a shared responsibility among researchers and developers to avoid triggering another AI winter through unsafe AI designs.
The next presentation was by Richard Mallah of the Future of Life Institute, who was there to represent "Medium Term AI Safety". He pointed out the explicit/implicit distinction between different modeling techniques in AI systems, as well as the explicit/implicit distinction between different AI actuation techniques. He talked about the difficulty of value specification and the concept of instrumental subgoals as an important issue in the case of complex AIs which are beyond human understanding. He said that even a slight misalignment of AI values with regard to human values along one parameter could lead to a strongly negative outcome, because machine learning parameters don't strictly correspond to the things that humans care about.
Mallah stated that open-world discovery leads to self-discovery, which can lead to reward hacking or a loss of control. He underscored the importance of causal accounting, which is distinguishing causation from correlation in AI systems. He said that we should extend machine learning verification to self-modification. Finally, he talked about introducing non-self-centered ontology to AI systems and bounding their behavior.
The audience was generally quiet and respectful during Richard's talk. I sensed that at least a few of them labelled him as part of the 'superintelligence out-group' and dismissed him accordingly, but I did not learn what most people's thoughts or reactions were. In the next panel featuring three speakers, he wasn't the recipient of any questions regarding his presentation or ideas.
Tom Mitchell from CMU gave the next talk. He talked about both making AI systems safer, and using AI to make other systems safer. He said that risks to humanity from other kinds of issues besides AI were the "big deals of 2016" and that we should make sure that the potential of AIs to solve these problems is realized. He wanted to focus on the detection and remediation of all failures in AI systems. He said that it is a novel issue that learning systems defy standard pre-testing ("as Richard mentioned") and also brought up the purposeful use of AI for dangerous things.
Some interesting points were raised in the panel. Andrew did not have a direct response to the implications of AI ethics being determined by the predominantly white people of the US/UK where most AIs are being developed. He said that ethics in AIs will have to be decided by society, regulators, manufacturers, and human rights organizations in conjunction. He also said that our cost functions for AIs will have to get more and more complicated as AIs get better, and he said that he wants to separate unintended failures from superintelligence type scenarios. On trolley problems in self driving cars and similar issues, he said "it's got to be complicated and messy."
Dario Amodei of Google Deepbrain, who co-authored the paper on concrete problems in AI safety, gave the next talk. He said that the public focus is too much on AGI/ASI and wants more focus on concrete/empirical approaches. He discussed the same problems that pose issues in advanced general AI, including flawed objective functions and reward hacking. He said that he sees long term concerns about AGI/ASI as "extreme versions of accident risk" and that he thinks it's too early to work directly on them, but he believes that if you want to deal with them then the best way to do it is to start with safety in current systems. Mostly he summarized the Google paper in his talk.
In her presentation, Claire Le Goues of CMU said "before we talk about Skynet we should focus on problems that we already have." She mostly talked about analogies between software bugs and AI safety, the similarities and differences between the two and what we can learn from software debugging to help with AI safety.
Robert Rahmer of IARPA discussed CAUSE, a cyberintelligence forecasting program which promises to help predict cyber attacks. It is a program which is still being put together.
In the panel of the above three, autonomous weapons were discussed, but no clear policy stances were presented.
John Launchbury gave a talk on DARPA research and the big picture of AI development. He pointed out that DARPA work leads to commercial applications and that progress in AI comes from sustained government investment. He classified AI capabilities into "describing," "predicting," and "explaining" in order of increasing difficulty, and he pointed out that old fashioned "describing" still plays a large role in AI verification. He said that "explaining" AIs would need transparent decisionmaking and probabilistic programming (the latter would also be discussed by others at the conference).
The next talk came from Jason Gaverick Matheny, the director of IARPA. Matheny talked about four requirements in current and future AI systems: verification, validation, security, and control. He wanted "auditability" in AI systems as a weaker form of explainability. He talked about the importance of "corner cases" for national intelligence purposes, the low probability, high stakes situations where we have limited data - these are situations where we have significant need for analysis but where the traditional machine learning approach doesn't work because of its overwhelming focus on data. Another aspect of national defense is that it has a slower decision tempo, longer timelines, and longer-viewing optics about future events.
He said that assessing local progress in machine learning development would be important for global security and that we therefore need benchmarks to measure progress in AIs. He ended with a concrete invitation for research proposals from anyone (educated or not), for both large scale research and for smaller studies ("seedlings") that could take us "from disbelief to doubt".
The difference in timescales between different groups was something I noticed later on, after hearing someone from the DoD describe their agency as having a longer timeframe than the Homeland Security Agency, and someone from the White House describe their work as being crisis reactionary.
The next presentation was from Andrew Grotto, senior director of cybersecurity policy at the National Security Council. He drew a close parallel from the issue of genetically modified crops in Europe in the 1990's to modern day artificial intelligence. He pointed out that Europe utterly failed to achieve widespread cultivation of GMO crops as a result of public backlash. He said that the widespread economic and health benefits of GMO crops were ignored by the public, who instead focused on a few health incidents which undermined trust in the government and crop producers. He had three key points: that risk frameworks matter, that you should never assume that the benefits of new technology will be widely perceived by the public, and that we're all in this together with regard to funding, research progress and public perception.
In the Q&A between Launchbury, Matheny, and Grotto after Grotto's presentation, it was mentioned that the economic interests of farmers worried about displacement also played a role in populist rejection of GMOs, and that a similar dynamic could play out with regard to automation causing structural unemployment. Grotto was also asked what to do about bad publicity which seeks to sink progress in order to avoid risks. He said that meetings like SafArtInt and open public dialogue were good.
One person asked what Launchbury wanted to do about AI arms races with multiple countries trying to "get there" and whether he thinks we should go "slow and secure" or "fast and risky" in AI development, a question which provoked laughter in the audience. He said we should go "fast and secure" and wasn't concerned. He said that secure designs for the Internet once existed, but the one which took off was the one which was open and flexible.
Another person asked how we could avoid discounting outliers in our models, referencing Matheny's point that we need to include corner cases. Matheny affirmed that data quality is a limiting factor to many of our machine learning capabilities. At IARPA, we generally try to include outliers until they are sure that they are erroneous, said Matheny.
Another presentation came from Tom Dietterich, president of the Association for the Advancement of Artificial Intelligence. He said that we have not focused enough on safety, reliability and robustness in AI and that this must change. Much like Eric Horvitz, he drew a distinction between robustness against errors within the scope of a model and robustness against unmodeled phenomena. On the latter issue, he talked about solutions such as expanding the scope of models, employing multiple parallel models, and doing creative searches for flaws - the latter doesn't enable verification that a system is safe, but it nevertheless helps discover many potential problems. He talked about knowledge-level redundancy as a method of avoiding misspecification - for instance, systems could identify objects by an "ownership facet" as well as by a "goal facet" to produce a combined concept with less likelihood of overlooking key features. He said that this would require wider experiences and more data.
There were many other speakers who brought up a similar set of issues: the user of cybersecurity techniques to verify machine learning systems, the failures of cybersecurity as a field, opportunities for probabilistic programming, and the need for better success in AI verification. Inverse reinforcement learning was extensively discussed as a way of assigning values. Jeanette Wing of Microsoft talked about the need for AIs to reason about the continuous and the discrete in parallel, as well as the need for them to reason about uncertainty (with potential meta levels all the way up). One point which was made by Sarah Loos of Google was that proving the safety of an AI system can be computationally very expensive, especially given the combinatorial explosion of AI behaviors.
In one of the panels, the idea of government actions to ensure AI safety was discussed. No one was willing to say that the government should regulate AI designs. Instead they stated that the government should be involved in softer ways, such as guiding and working with AI developers, and setting standards for certification.
Pictures: https://imgur.com/a/49eb7
In between these presentations I had time to speak to individuals and listen in on various conversations. A high ranking person from the Department of Defense stated that the real benefit of autonomous systems would be in terms of logistical systems rather than weaponized applications. A government AI contractor drew the connection between Mallah's presentation and the recent press revolving around superintelligence, and said he was glad that the government wasn't worried about it.
I talked to some insiders about the status of organizations such as MIRI, and found that the current crop of AI safety groups could use additional donations to become more established and expand their programs. There may be some issues with the organizations being sidelined; after all, the Google Deepbrain paper was essentially similar to a lot of work by MIRI, just expressed in somewhat different language, and was more widely received in mainstream AI circles.
In terms of careers, I found that there is significant opportunity for a wide range of people to contribute to improving government policy on this issue. Working at a group such as the Office of Science and Technology Policy does not necessarily require advanced technical education, as you can just as easily enter straight out of a liberal arts undergraduate program and build a successful career as long as you are technically literate. (At the same time, the level of skepticism about long term AI safety at the conference hinted to me that the signalling value of a PhD in computer science would be significant.) In addition, there are large government budgets in the seven or eight figure range available for qualifying research projects. I've come to believe that it would not be difficult to find or create AI research programs that are relevant to long term AI safety while also being practical and likely to be funded by skeptical policymakers and officials.
I also realized that there is a significant need for people who are interested in long term AI safety to have basic social and business skills. Since there is so much need for persuasion and compromise in government policy, there is a lot of value to be had in being communicative, engaging, approachable, appealing, socially savvy, and well-dressed. This is not to say that everyone involved in long term AI safety is missing those skills, of course.
I was surprised by the refusal of almost everyone at the conference to take long term AI safety seriously, as I had previously held the belief that it was more of a mixed debate given the existence of expert computer scientists who were involved in the issue. I sensed that the recent wave of popular press and public interest in dangerous AI has made researchers and policymakers substantially less likely to take the issue seriously. None of them seemed to be familiar with actual arguments or research on the control problem, so their opinions didn't significantly change my outlook on the technical issues. I strongly suspect that the majority of them had their first or possibly only exposure to the idea of the control problem after seeing badly written op-eds and news editorials featuring comments from the likes of Elon Musk and Stephen Hawking, which would naturally make them strongly predisposed to not take the issue seriously. In the run-up to the conference, websites and press releases didn't say anything about whether this conference would be about long or short term AI safety, and they didn't make any reference to the idea of superintelligence.
I sympathize with the concerns and strategy given by people such as Andrew Moore and Andrew Grotto, which make perfect sense if (and only if) you assume that worries about long term AI safety are completely unfounded. For the community that is interested in long term AI safety, I would recommend that we avoid competitive dynamics by (a) demonstrating that we are equally strong opponents of bad press, inaccurate news, and irrational public opinion which promotes generic uninformed fears over AI, (b) explaining that we are not interested in removing funding for AI research (even if you think that slowing down AI development is a good thing, restricting funding yields only limited benefits in terms of changing overall timelines, whereas those who are not concerned about long term AI safety would see a restriction of funding as a direct threat to their interests and projects, so it makes sense to cooperate here in exchange for other concessions), and (c) showing that we are scientifically literate and focused on the technical concerns. I do not believe that there is necessarily a need for the two "sides" on this to be competing against each other, so it was disappointing to see an implication of opposition at the conference.
Anyway, Ed Felten announced a request for information from the general public, seeking popular and scientific input on the government's policies and attitudes towards AI: https://www.whitehouse.gov/webform/rfi-preparing-future-artificial-intelligence
Overall, I learned quite a bit and benefited from the experience, and I hope the insight I've gained can be used to improve the attitudes and approaches of the long term AI safety community.
Goal completion: noise, errors, bias, prejudice, preference and complexity
A putative new idea for AI control; index here.

Feedback on op-ed highlighting the dangers of the OpenAI project
I'm really worried about the OpenAI project recently discussed on this forum, and want to use the platform and credibility I have with my leadership of Intentional Insights and public reputation to try to publish an op-ed in something like the Huffington Post highlighting the dangers of the OpenAI project. Now, most people don't think of AI as a threat: they either don't know much about it, or think of it as a futuristic thing that only nerds care about.
So the purpose of the op-ed is to use emotions, visualization, narrative, and other engaging tactics to do the following: tie AI to something people are concerned about, namely terrorism; highlight the dangers of a personal AI through framing it as a potential weapon; finally, provide people with clear next steps to take by encouraging people to learn about AI safety and donating to MIRI, as well as writing to OpenAI. This has the meta-goal, of course, of getting people to think about MIRI and AI safety.
I'd appreciate feedback on ways to optimize the op-ed to achieve the goals outlined above better. Keep in mind, the op-ed is limited to 700 words, and it's about at that limit, so if you suggest adding something, please keep it as succinct as possible, and ideally suggest taking something away as well. The op-ed draft is below the black line. Thanks!
EDIT Based on feedback from Eliezer Yudkowsy, Mack Hidalgo, and Eliot Redelman, it seems this is not the optimal path to pursue at this time, and I updated to avoiding publishing this. You can see the discussion here.
______________________________________________________________________________________________________________
Will Tomorrow's Terrorists Be Armed By Utopian Billionaires?
The horrible attacks in San Bernadino, in Paris, and in other western countries show the dangers of terrorism. Terrorists associated with ISIS used bombs and guns to murder dozens and hundreds of innocent people, at the expense of their own lives. Yet utopian billionaires have recently donated over a billion dollars to a project that can give the terrorists of tomorrow a much more powerful weapon, capable of killing dozens and hundreds of thousands, without sacrificing their own lives.
What is this futuristic weapon? It’s a personal artificial intelligence unit. This personal AI would have superhuman intelligence and capacity to manipulate the world.
Imagine what a terrorist could do with this weapon. Without any knowledge of programming, he could direct it to hack into the air traffic control system and cause hundreds of plane crashes. For another transportation example, he can cause all the lights in a city to turn green at once, leading to thousands of car crashes. Perhaps he can have it hack into a nuclear power plant and override its safety systems, resulting in a nuclear meltdown. There are so many other things that an AI can do.
Why would billionaires provide such a weapon to terrorists? For the noblest of reasons.
There are a number of governments and companies working on advancing AI research. Worried about the possibility of anyone getting there first and using the power of for themselves, a number of prominent tech luminaries – people like Elon Musk, Peter Thiel, and Sam Altman – contributed over a billion dollars to found a non-profit called OpenAI. Their goal is to create advanced AI and provide it to the public freely, embodying the spirit of open technology.
In a recent interview with Steven Levy of Backchannel, Musk described the goal as follows: “we want AI to be widespread… to the degree that you can tie it to an extension of individual human will, that is also good. As in an AI extension of yourself, such that each person is essentially symbiotic with AI as opposed to the AI being a large central intelligence that’s kind of an other.”
Let’s take a step back and think about Musk’s statement rationally. On the one hand, it’s appealing to have a personal AI and not have it be under the control of a government entity. This model would work well if we assume all people are basically good. Yet the terrorist attacks provide definitive evidence they are not. What do we do about that?
Musk states: “I think the best defense against the misuse of AI is to empower as many people as possible to have AI. If everyone has AI powers, then there’s not any one person or a small set of individuals who can have AI superpower.”
There is a huge problems with that position, what is known as the “attacker’s advantage.” Imagine two people with guns. If the first takes the gun out and shoots the other, it doesn’t matter if the second had the gun in their pocket. By the same token, if a terrorist’s AI hacks into an air traffic control tower and causes your plane to crash, it doesn’t matter if you had an AI too.
An AI is simply too dangerous to give to individuals who may have bad intentions. Terrorism is only the most extreme example. Imagine a bar fight with a room full of drunk people who tell their AIs to attack the other people. Imagine a riot after a football team loses with AIs involved. I shudder at the possibilities.
A much better scenario is for a central agency to have control over AI. Ideally, this central agency would orient toward creating a human-friendly AI that would serve human flourishing, a topic currently being researched by another non-profit organization, the Machine Intelligence Research Institute. Something you can do practically to counter the nightmare scenarios of OpenAI is to contribute to MIRI’s efforts, as well as write to OpenAI at info@openai.com and encourage them to change the nature of their project.
There is no doubt that artifical intelligence will come about, but it’s vital to make sure it comes about in a manner conducive to humanity’s wellbeing.
Wear a Helmet While Driving a Car
A 2006 study showed that “280,000 people in the U.S. receive a motor vehicle induced traumatic brain injury every year” so you would think that wearing a helmet while driving would be commonplace. Race car drivers wear helmets. But since almost no one wears a helmet while driving a regular car, you probably fear that if you wore one you would look silly, attract the notice of the police for driving while weird, or the attention of another driver who took your safety attire as a challenge. (Car drivers are more likely to hit bicyclists who wear helmets.)
The $30+shipping Crasche hat is designed for people who should wear a helmet but don’t. It looks like a ski cap, but contains concealed lightweight protective material. People who have signed up for cryonics, such as myself, would get an especially high expected benefit from using a driving helmet because we very much want our brains to “survive” even a “fatal” crash. I have been using a Crasche hat for about a week.
Self-improvement without self-modification
This is just a short note to point out that AIs can self-improve without having to self-modify. So locking down an agent from self-modification is not an effective safety measure.
How could AIs do that? The easiest and the most trivial is to create a subagent, and transfer their resources and abilities to it ("create a subagent" is a generic way to get around most restriction ideas).
Or it the AI remains unchanged and in charge, it could change the whole process around itself, so that the whole process changes and improves. For instance, if the AI is inconsistent and has to pay more attention to problems that are brought to its attention than problems that aren't, it can start to act to manage the news (or the news-bearers) to hear more of what it wants. If it can't experiment on humans, it will give advice that will cause more "natural experiments", and so on. It will gradually try to reform its environment to get around its programmed limitations.
Anyway, that was nothing new or deep, just a reminder point I hadn't seen written out.
Learning to get things right first time
These are quick notes on an idea for an indirect strategy to increase the likelihood of society acquiring robustly safe and beneficial AI.
Motivation:
-
Most challenges we can approach with trial-and-error, so many of our habits and social structures are set up to encourage this. There are some challenges where we may not get this opportunity, and it could be very helpful to know what methods help you to tackle a complex challenge that you need to get right first time.
-
Giving an artificial intelligence good values may be a particularly important challenge, and one where we need to be correct first time. (Distinct from creating systems that act intelligently at all, which can be done by trial and error.)
-
Building stronger societal knowledge about how to approach such problems may make us more robustly prepared for such challenges. Having more programmers in the AI field familiar with the techniques is likely to be particularly important.
Idea: Develop methods for training people to write code without bugs.
-
Trying to teach the skill of getting things right first time.
-
Writing or editing code that has to be bug-free without any testing is a fairly easy challenge to set up, and has several of the right kind of properties. There are some parallels between value specification and programming.
-
Set-up puts people in scenarios where they only get one chance -- no opportunity to test part/all of the code, just analyse closely before submitting.
-
Interested in personal habits as well as social norms or procedures that help this.
-
Daniel Dewey points to standards for code on the space shuttle as a good example of getting high reliability code edits.
-
-
How to implement:
-
Ideal: Offer this training to staff at software companies, for profit.
-
Although it’s teaching a skill under artificial hardship, it seems plausible that it could teach enough good habits and lines of thinking to noticeably increase productivity, so people would be willing to pay for this.
-
Because such training could create social value in the short run, this might give a good opportunity to launch as a business that is simultaneously doing valuable direct work.
-
Similarly, there might be a market for a consultancy that helped organisations to get general tasks right the first time, if we knew how to teach that skill.
-
-
More funding-intensive, less labour intensive: run competitions with cash prizes
-
Try to establish it as something like a competitive sport for teams.
-
Outsource the work of determining good methods to the contestants.
-
This is all quite preliminary and I’d love to get more thoughts on it. I offer up this idea because I think it would be valuable but not my comparative advantage. If anyone is interested in a project in this direction, I’m very happy to talk about it.
On the Boxing of AIs
I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods
So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.
Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.
My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.
With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.
I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:
Separate the Gatekeeper and the Observer
How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.
You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.
Afterwards, the Observer can make a low-detail report to the Gatekeeper.
(You might want to drug the Observer with something that prevents their memory from working too well...)
Automatic Testing
This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.
The Lesson
I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.
The Hardcore AI Box Experiment
I previously proposed a way to box an AI.
For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:
The Hardcore AI Box Experiment Rules
There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.
The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.
The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.
This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:
If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.
The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.
Possible extra rules
I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:
- The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
- The AI does not know the Gatekeeper. At all.
- The Gatekeeper can at any time rewind the AI any duration.
Strategies
I found some semi-realistic strategies. I would love to see if you can find more.
Gatekeeper
- Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
- The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.
AI
- If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
- If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
- If the AIs are controlled by random people, they might end up telling you that you are in a box.
- If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.
Boxing an AI?
Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.
Values at compile time
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".
And then of course there's those orders where humans really don't understand what they themselves want...
I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.
What I mean...
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.
First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.
What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.
Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.
All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.
Models as definitions
A putative new idea for AI control; index here.
The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.
Modelling and defining
Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?
Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...
As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.
My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).
Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.
However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.
We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.
It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.
Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.
Being human
Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.
Then simply use this model as the definition of human for an AI’s motivation.
What could possibly go wrong?
I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.
There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).
This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?).
Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).
The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.
To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness).
Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.
We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.
Restrictions that are hard to hack
A putative new idea for AI control; index here.
Very much in the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach."
Difficult children
Suppose you have a child, that you sent to play in their room. You want them to play quietly and silently, so you want them:
"I'll be checking up on you!"
The child, however, has modelled you well, and knows that you will look in briefly at midnight and then go away. The child has two main options:
- Play quietly the whole time.
- Be as noisy as they want, until around 23:59, then be totally quiet for two minutes, then go back to being noisy.
We could call the first option obeying the spirit of the law, and the second obeying the letter.
Recent AI safety work
(Crossposted from ordinary ideas).
I’ve recently been thinking about AI safety, and some of the writeups might be interesting to some LWers:
- Ideas for building useful agents without goals: approval-directed agents, approval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
- A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
- Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagers, adversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.
I’m excited about a few possible next steps:
- Under the (highly improbable) assumption that various deep learning architectures could yield human-level performance, could they also predictably yield safe AI? I think we have a good chance of finding a solution---i.e. a design of plausibly safe AI, under roughly the same assumptions needed to get human-level AI---for some possible architectures. This would feel like a big step forward.
- For what capabilities can we solve the steering problem? I had originally assumed none, but I am now interested in trying to apply the ideas from the approval-directed agents post. From easiest to hardest, I think there are natural lines of attack using any of: natural language question answering, precise question answering, sequence prediction. It might even be possible using reinforcement learners (though this would involve different techniques).
- I am very interested in implementing effective debates, and am keen to test some unusual proposals. The connection to AI safety is more impressionistic, but in my mind these techniques are closely linked with approval-directed behavior.
- I’m currently writing up a concrete architecture for approval-directed agents, in order to facilitate clearer discussion about the idea. This kind of work that seems harder to do in advance, but at this point I think it’s mostly an exposition problem.
What's special about a fantastic outcome? Suggestions wanted.
I've been returning to my "reduced impact AI" approach, and currently working on some idea.
What I need is some ideas on features that might distinguish between an excellent FAI outcome, and a disaster. The more abstract and general the ideas, the better. Anyone got some suggestions? Don't worry about quality at this point, originality is more prized!
I'm looking for something generic that is easy to measure. At a crude level, if the only options were "papercliper" vs FAI, then we could distinguish those worlds by counting steel content.
So basically some more or less objective measure that has a higher proportion of good outcomes than the baseline.
Groundwork for AGI safety engineering
This is a very basic introduction to AGI safety work, cross-posted from the MIRI blog. The discussion of AI V&V methods (mostly in the 'early steps' section) is probably the only part that will be new to regulars here.
Improvements in AI are resulting in the automation of increasingly complex and creative human behaviors. Given enough time, we should expect artificial reasoners to begin to rival humans in arbitrary domains, culminating in artificial general intelligence (AGI).
A machine would qualify as an 'AGI', in the intended sense, if it could adapt to a very wide range of situations to consistently achieve some goal or goals. Such a machine would behave intelligently when supplied with arbitrary physical and computational environments, in the same sense that Deep Blue behaves intelligently when supplied with arbitrary chess board configurations — consistently hitting its victory condition within that narrower domain.
Since generally intelligent software could help automate the process of thinking up and testing hypotheses in the sciences, AGI would be uniquely valuable for speeding technological growth. However, this wide-ranging productivity also makes AGI a unique challenge from a safety perspective. Knowing very little about the architecture of future AGIs, we can nonetheless make a few safety-relevant generalizations:
- Because AGIs are intelligent, they will tend to be complex, adaptive, and capable of autonomous action, and they will have a large impact where employed.
- Because AGIs are general, their users will have incentives to employ them in an increasingly wide range of environments. This makes it hard to construct valid sandbox tests and requirements specifications.
- Because AGIs are artificial, they will deviate from human agents, causing them to violate many of our natural intuitions and expectations about intelligent behavior.
Today's AI software is already tough to verify and validate, thanks to its complexity and its uncertain behavior in the face of state space explosions. Menzies & Pecheur (2005) give a good overview of AI verification and validation (V&V) methods, noting that AI, and especially adaptive AI, will often yield undesired and unexpected behaviors.
An adaptive AI that acts autonomously, like a Mars rover that can't be directly piloted from Earth, represents an additional large increase in difficulty. Autonomous safety-critical agents need to make irreversible decisions in dynamic environments with very low failure rates. The state of the art in safety research for autonomous systems is improving, but continues to lag behind capabilities work. Hinchman et al. (2012) write:
As autonomous systems become more complex, the notion that systems can be fully tested and all problems will be found is becoming an impossible task. This is especially true in unmanned/autonomous systems. Full test is becoming increasingly challenging on complex system. As these systems react to more environmental [stimuli] and have larger decision spaces, testing all possible states and all ranges of the inputs to the system is becoming impossible. [...] As systems become more complex, safety is really risk hazard analysis, i.e. given x amount of testing, the system appears to be safe. A fundamental change is needed. This change was highlighted in the 2010 Air Force Technology Horizon report, "It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use." [...]
The move towards more autonomous systems has lifted this need [for advanced verification and validation techniques and methodologies] to a national level.
AI acting autonomously in arbitrary domains, then, looks particularly difficult to verify. If AI methods continue to see rapid gains in efficiency and versatility, and especially if these gains further increase the opacity of AI algorithms to human inspection, AI safety engineering will become much more difficult in the future. In the absence of any reason to expect a development in the lead-up to AGI that would make high-assurance AGI easy (or AGI itself unlikely), we should be worried about the safety challenges of AGI, and that worry should inform our research priorities today.
Below, I’ll give reasons to doubt that AGI safety challenges are just an extension of narrow-AI safety challenges, and I’ll list some research avenues people at MIRI expect to be fruitful.
Reduced impact in practice: randomly sampling the future
This was an idea developed along with Eliezer when he was at the FHI in November.
In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.
Ignore the "output channel" from the original reduced impact post, and consider this situation:
We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?
The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:
Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.
The idiot savant AI isn't an idiot
A stub on a point that's come up recently.
If I owned a paperclip factory, and casually told my foreman to improve efficiency while I'm away, and he planned a takeover of the country, aiming to devote its entire economy to paperclip manufacturing (apart from the armament factories he needed to invade neighbouring countries and steal their iron mines)... then I'd conclude that my foreman was an idiot (or being wilfully idiotic). He obviously had no idea what I meant. And if he misunderstood me so egregiously, he's certainly not a threat: he's unlikely to reason his way out of a paper bag, let alone to any position of power.
If I owned a paperclip factory, and casually programmed my superintelligent AI to improve efficiency while I'm away, and it planned a takeover of the country... then I can't conclude that the AI is an idiot. It is following its programming. Unlike a human that behaved the same way, it probably knows exactly what I meant to program in. It just doesn't care: it follows its programming, not its knowledge about what its programming is "meant" to be (unless we've successfully programmed in "do what I mean", which is basically the whole of the challenge). We can't therefore conclude that it's incompetent, unable to understand human reasoning, or likely to fail.
We can't reason by analogy with humans. When AIs behave like idiot savants with respect to their motivations, we can't deduce that they're idiots.
Evaluating the feasibility of SI's plan
(With Kaj Sotala)
SI's current R&D plan seems to go as follows:
1. Develop the perfect theory.
2. Implement this as a safe, working, Artificial General Intelligence -- and do so before anyone else builds an AGI.
The Singularity Institute is almost the only group working on friendliness theory (although with very few researchers). So, they have the lead on Friendliness. But there is no reason to think that they will be ahead of anyone else on the implementation.
The few AGI designs we can look at today, like OpenCog, are big, messy systems which intentionally attempt to exploit various cognitive dynamics that might combine in unexpected and unanticipated ways, and which have various human-like drives rather than the sort of supergoal-driven, utility-maximizing goal hierarchies that Eliezer talks about, or which a mathematical abstraction like AIXI employs.
A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory. Without the constraint of safety, one of them will beat SI in the race to AGI. SI cannot ignore this. Real-world, imperfect, safety measures for real-world, imperfect AGIs are needed. These may involve mechanisms for ensuring that we can avoid undesirable dynamics in heuristic systems, or AI-boxing toolkits usable in the pre-explosion stage, or something else entirely.
SI’s hoped-for theory will include a reflexively consistent decision theory, something like a greatly refined Timeless Decision Theory. It will also describe human value as formally as possible, or at least describe a way to pin it down precisely, something like an improved Coherent Extrapolated Volition.
The hoped-for theory is intended to provide not only safety features, but also a description of the implementation, as some sort of ideal Bayesian mechanism, a theoretically perfect intelligence.
SIers have said to me that SI's design will have a decisive implementation advantage. The idea is that because strap-on safety can’t work, Friendliness research necessarily involves more fundamental architectural design decisions, which also happen to be general AGI design decisions that some other AGI builder could grab and save themselves a lot of effort. The assumption seems to be that all other designs are based on hopelessly misguided design principles. SI-ers, the idea seems to go, are so smart that they'll build AGI far before anyone else. Others will succeed only when hardware capabilities allow crude near-brute-force methods to work.
Yet even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.
We can compare SI’s future AI design to AIXI, another mathematically perfect AI formalism (though it has some critical reflexivity issues). Schmidhuber, Hutter, and colleagues think that their AXI can be scaled down into a feasible implementation, and have implemented some toy systems. Similarly, any actual AGI based on SI's future theories will have to stray far from its mathematically perfected origins.
Moreover, SI's future friendliness proof may simply be wrong. Eliezer writes a lot about logical uncertainty, the idea that you must treat even purely mathematical ideas with same probabilistic techniques as any ordinary uncertain belief. He pursues this mostly so that his AI can reason about itself, but the same principle applies to Friendliness proofs as well.
Perhaps Eliezer thinks that a heuristic AGI is absolutely doomed to failure; that a hard takeoff immediately soon after the creation of the first AGI is so overwhelmingly likely that a mathematically designed AGI is the only one that could stay Friendly. In that case, we have to work on a pure-theory approach, even if it has a low chance of being finished first. Otherwise we'll be dead anyway. If an embryonic AGI will necessarily undergo an intelligence explosion, we have no choice but to "shut up and do the impossible."
I am all in favor of gung-ho knife-between-the teeth projects. But when you think that your strategy is impossible, then you should also look for a strategy which is possible, if only as a fallback. Thinking about safety theory until drops of blood appear on your forehead (as Eliezer puts it, quoting Gene Fowler), is all well and good. But if there is only a 10% chance of achieving 100% safety (not that there really is any such thing), then I'd rather go for a strategy that provides only a 40% promise of safety, but with a 40% chance of achieving it. OpenCog and the like are going to be developed regardless, and probably before SI's own provably friendly AGI. So, even an imperfect safety measure is better than nothing.
If heuristic approaches have a 99% chance of an immediate unfriendly explosion, then that might be wrong. But SI, better than anyone, should know that any intuition-based probability estimate of “99%” really means “70%”. Even if other approaches are long-shots, we should not put all our eggs in one basket. Theoretical perfection and stopgap safety measures can be developed in parallel.
Given what we know about human overconfidence and the general reliability of predictions, the actual outcome will to a large extent be something that none of us ever expected or could have predicted. No matter what happens, progress on safety mechanisms for heuristic AGI will improve our chances if something entirely unexpected happens.
What impossible thing should SI be shutting up and doing? For Eliezer, it’s Friendliness theory. To him, safety for heuristic AGI is impossible, and we shouldn't direct our efforts in that direction. But why shouldn't safety for heuristic AGI be another impossible thing to do?
(Two impossible things before breakfast … and maybe a few more? Eliezer seems to be rebuilding logic, set theory, ontology, epistemology, axiology, decision theory, and more, mostly from scratch. That's a lot of impossibles.)
And even if safety for heuristic AGIs is really impossible for us to figure out now, there is some chance of an extended soft takeoff that will allow for the possibility of us developing heuristic AGIs which will help in figuring out AGI safety, whether because we can use them for our tests, or because they can by applying their embryonic general intelligence to the problem. Goertzel and Pitt have urged this approach.
Yet resources are limited. Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.
Even if so, SI should evangelize AGI safety to other researchers, not only as a general principle, but also by offering theoretical insights that may help them as they work on their own safety mechanisms.
In summary:
1. AGI development which is unconstrained by a friendliness requirement is likely to beat a provably-friendly design in a race to implementation, and some effort should be expended on dealing with this scenario.
2. Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs. However, we do not know whether or not this is true.
3. Even the provably friendly design will face real-world compromises and errors in its implementation, so the implementation will not itself be provably friendly. Thus, safety protections of the sort needed for heuristic design are needed even for a theoretically Friendly design.
The challenges of bringing up AIs
At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).
The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between:
- An AGI being trained to be friendly.
- An AGI being trained to lie and conceal.
- An AGI that will behave completely differently once out of the training/testing/trust-building environment.
- An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.
Trapping AIs via utility indifference
If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.
That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.
We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:
"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."
Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.
Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.
On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.
And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.
Yet another safe oracle AI proposal
Previously I posted a proposal for a safe self-improving limited oracle AI but I've fleshed out the idea a bit more now.
Disclaimer: don't try this at home. I don't see any catastrophic flaws in this but that doesn't mean that none exist.
This framework is meant to safely create an AI that solves verifiable optimization problems; that is, problems whose answers can be checked efficiently. This set mainly consists of NP-like problems such as protein folding, automated proof search, writing hardware or software to specifications, etc.
This is NOT like many other oracle AI proposals that involve "boxing" an already-created possibly unfriendly AI in a sandboxed environment. Instead, this framework is meant to grow a self-improving seed AI safely.
Overview
- Have a bunch of sample optimization problems.
- Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
- When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).
- Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).
Definitions
First, let's say we're writing this all in Python. In real life we'd use a language like Lisp because we're doing a lot of treatment of code as data, but Python should be sufficient to demonstrate the basic ideas behind the system.
We have a function called steps_bounded_eval_function. This function takes 3 arguments: the source code of the function to call, the argument to the function, and the time limit (in steps). The function will eval the given source code and call the defined function with the given argument in a protected, sandboxed environment, with the given steps limit. It will return either: 1. None, if the program does not terminate within the steps limit. 2. A tuple (output, steps_taken): the program's output (as a string) and the steps the program took.
Examples:
steps_bounded_eval_function("""
def function(x):
return x + 5
""", 4, 1000)
evaluates to (9, 3), assuming that evaluating the function took 3 ticks, because function(4) = 9.
steps_bounded_eval_function("""
def function(x):
while True: # infinite loop
pass
""", 5, 1000
evaluates to None, because the defined function doesn't return in time. We can write steps_bounded_eval_function as a meta-circular interpreter with a bit of extra logic to count how many steps the program uses.
Now I would like to introduce the notion of a problem. A problem consists of the following:
-
An answer scorer. The scorer should be the Python source code for a function. This function takes in an answer string and scores it, returning a number from 0 to 1. If an error is encountered in the function it is equivalent to returning 0.
-
A steps penalty rate, which should be a positive real number.
Let's consider a simple problem (subset sum):
{'answer_scorer': """
def function(answer):
nums = [4, 5, -3, -5, -6, 9]
# convert "1,2,3" to [1, 2, 3]
indexes = map(int, answer.split(','))
assert len(indexes) >= 1
sum = 0
for i in indexes:
sum += nums[i]
if sum == 0:
return 1
else:
return 0
""",
'steps_penalty_rate': 0.000001}
We can see that the scorer function returns 1 if and only if the answer specifies the indexes of numbers in the list nums that sum to 0 (for example, '0,1,3,4' because 4+5-3-6=0).
An optimizer is a program that is given a problem and attempts to solve the problem, returning an answer.
The score of an optimizer on a problem is equal to the score according to the answer-scorer, minus the steps penalty rate multiplied by the number of steps used by the optimizer. That is, the optimizer is rewarded for returning a better answer in less time. We can define the following function to get the score of an optimizer (Python source code) for a given problem:
def problem_score(problem, optimizer_source):
# run the optimizer on the problem
result = steps_bounded_eval_function(
optimizer_source, problem, 1 / problem['steps_penalty_rate'])
if result == None: # used all available steps, or got an error
return 0.0
answer, steps_taken = result # optimizer returned a result in time
# get the score using the problem's answer_scorer
answer_score = eval_function(problem['answer_scorer'], answer)
assert 0 <= answer_score <= 1
# penalize for taking up time and make sure the result is non-negative
return max(0.0, answer_score - problem['steps_penalty_rate'] * steps_taken)
A simple optimizer that detects subset-sum problems (by inspecting the source code) and uses a brute-force method to solve it will get a problem_score close to 1 for small subset-sum problems. This optimizer would not do well on other problems; other techniques (such as evaluating the problem source code with different answers and choosing a good one) are needed to solve a variety of different optimization problems. Writing a good optimizer is very hard work, just like any seed AI.
Framework
The framework has 4 modules:
-
A set of training problems. These should cover a wide range of problems that we would like the AI to solve.
-
An optimizer, written in Python. This should be seeded with a very good human-written optimizer. This is deliberately unspecified by me because it's a really hard problem (as is any seed AI).
-
A scorer for optimizer source code defined as follows:
def optimizer_score(candidate_optimizer_source): training_problems = [...training problems here...] # a parameter of the system; more on this later complexity_penalty = 0.1 # total up the program's score on all training problems total_score = 0.0 for problem in training_problems: total_score += problem_score(candidate_optimizer_source, problem) # penalize for complexity, to prevent overfitting total_score -= complexity_penalty * len(compress_binary(candidate_optimizer_source)) # return average score return max(0, total_score / len(training_problems))This gives a candidate optimizer a score in the range [0, 1] based on both its average performance on the sample set and its inherent complexity. Presumably optimizers with a higher optimizer_score will do better on future optimization problems.
-
A self-optimization thread. This thread continuously runs program 2 on a problem formed using 3's answer_scorer and an ever-decreasing steps_penalty_rate. Whenever program 2 outputs source code (optimizer_source) that is better than the current source code for 2, the source code for 2 is replaced with this new value. Also, humans can make improvements to program 2 if it increases its score according to 3's answer. Source code:
# assume we have access to an optimizer_source variable (program 2) def self_optimization_thread(): start_steps_penalty_rate = 0.000001 steps_penalty_rate = start_steps_penalty_rate while True: # loop forever self_optimization_problem = { # just use program 3 to score the optimizer 'answer_scorer': """ def function(candidate_optimizer_source): ... put the source code from program 3's optimizer_score here """, 'steps_penalty_rate': steps_penalty_rate } # call the optimizer (program 2) to optimize itself, giving it limited time result = steps_bounded_eval_function( optimizer_source, self_optimization_problem, 1 / steps_penalty_rate) changed = False if result is not None: # optimizer returned in time candidate_optimizer = result[0] # 2 returned a possible replacement for itself if optimizer_score(candidate_optimizer) > optimizer_score(optimizer_source): # 2's replacement is better than 2 optimizer_source = candidate_optimizer steps_penalty_rate = start_steps_penalty_rate changed = True if not changed: # give the optimizer more time to optimize itself on the next iteration steps_penalty_rate *= 0.5
So, what does this framework get us?
-
An super-optimizer, program 2. We can run it on new optimization problems and it should do very well on them.
-
Self-improvement. Program 4 will continuously use program 2 to improve itself. This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems. Also, the training set will guide human improvements to the optimizer.
-
Safety. I don't see why this setup has any significant probability of destroying the world. That doesn't mean we should disregard safety, but I think this is quite an accomplishment given how many other proposed AI designs would go catastrophically wrong if they recursively self-improved.
I will now evaluate the system according to these 3 factors.
Optimization ability
Assume we have a program for 2 that has a very very high score according to optimizer_score (program 3). I think we can be assured that this optimizer will do very very well on completely new optimization problems. By a principle similar to Occam's Razor, a simple optimizer that performs well on a variety of different problems should do well on new problems. The complexity penalty is meant to prevent overfitting to the sample problems. If we didn't have the penalty, then the best optimizer would just return the best-known human-created solutions to all the sample optimization problems.
What's the right value for complexity_penalty? I'm not sure. Increasing it too much makes the optimizer overly simple and stupid; decreasing it too much causes overfitting. Perhaps the optimal value can be found by some pilot trials, testing optimizers against withheld problem sets. I'm not entirely sure that a good way of balancing complexity with performance exists; more research is needed here.
Assuming we've conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement. What does this get us? Here are some useful optimization problems that fit in this framework:
-
Writing self-proving code to a specification. After writing a specification of the code in a system such as Coq, we simply ask the optimizer to optimize according to the specification. This would be very useful once we have a specification for friendly AI.
-
Trying to prove arbitrary mathematical statements. Proofs are verifiable in a relatively short amount of time.
-
Automated invention/design, if we have a model of physics to verify the invention against.
-
General induction/Occam's razor. Find a generative model for the data so far that optimizes P(model)P(data|model), with some limits on the time taken for the model program to run. Then we can run the model to predict the future.
-
Bioinformatics, e.g. protein folding.
These are all problems whose solutions can be efficiently evaluated and that a computer could plausibly solve, so I think this framework should provide good solutions to them. If the optimizer this framework produces performs well on all these problems, I think it deserves to be called an oracle AGI.
Self-improvement
It seems that, if my arguments about optimization ability are convincing, an optimizer that does very well on the sample problems should also do well on the specific optimization problem of improving itself. This is especially true if the training problems include some program-optimization problems (i.e. finding a program that does well on some task). Once the optimizer improves itself, it will be better at improving itself, and so on. If the AI is software-limited, the framework should allow it to recursively improve itself up to a very high limit. Then we can use its hardware design power to design better chips for it and allow it to optimize even further.
The optimizer's ability to self-improve is limited by the fact that the other parts of the program (1, 3, and 4) are fixed. The system is greedy in the sense that it just looks for even the slightest improvement, then changes the code and looks for more improvements, without any foresight. I don't know how much of a limitation this is. I don't see any reason that greedy improvements to the optimizer are insufficient for super-optimization from recursive self-improvement.
Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code). In practice we might want to extend the framework to include incremental improvements (improvements on top of existing good answers). I don't know if a good theory for scoring incremental optimizers exists.
Safety
Probably the most important property of the system. What use is optimization if the world has been converted to computron in order to power a super-powerful protein folder? Here are a few different possible threats:
Underhanded question answers
The optimizer could give answers that are very good according to the problem description but are specifically crafted to cause things unintended by the human operators. For example, it could create a software design that, while correct according to specifications, has a "bug" that causes it to hack into computers on the internet and hook them up to the optimizer system. I don't think this is at all likely because the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements. So any kind of long-term strategy (sacrificing some optimization power so it will have resources in the future) is out of the question. In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.
Bugs in the system
What if the system gives the optimizer an incorrect score under some circumstances (e.g. if it performs a certain pattern of memory accesses)? Say that, by chance, the optimizer's improvement to itself causes it to get an incorrect score. It might internalize the rule "perform memory accesses in this pattern" to get a higher score. This itself is not especially dangerous; the optimizer will rewrite itself to just do a bunch of weird memory accesses that give it a high score.
What might be more dangerous is if the optimizer discovers an underlying pattern behind the system's hackability. Since the optimizer is penalized for complexity, a program like "do things that, when executed on a certain virtual machine, cause this variable in the machine to be a high number" might have a higher score than "do this certain complex pattern of memory accesses". Then the optimizer might discover the best way to increase the score variable. In the absolute worst case, perhaps the only way to increase the score variable is by manipulating the VM to go on the internet and do unethical things. This possibility seems unlikely (if you can connect to the internet, you can probably just overwrite the score variable) but should be considered.
I think the solution is straightforward: have the system be isolated while the optimizer is running. Completely disconnect it from the internet (possibly through physical means) until the optimizer produces its answer. Now, I think I've already established that the answer will not be specifically crafted to improve future optimization power (e.g. by manipulating human operators), since the system is extremely short-sighted. So this approach should be safe. At worst you'll just get a bad answer to your question, not an underhanded one.
Malicious misuse
I think this is the biggest danger of the system, one that all AGI systems have. At high levels of optimization ability, the system will be able to solve problems that would help people do unethical things. For example it could optimize for cheap, destructive nuclear/biological/nanotech weapons. This is a danger of technological progress in general, but the dangers are magnified by the potential speed at which the system could self-improve.
I don't know the best way to prevent this. It seems like the project has to be undertaken in private; if the seed optimizer source were released, criminals would run it on their computers/botnets and possibly have it self-improve even faster than the ethical version of the system. If the ethical project has more human and computer resources than the unethical project, this danger will be minimized.
It will be very tempting to crowdsource the project by putting it online. People could submit improvements to the optimizer and even get paid for finding them. This is probably the fastest way to increase optimization progress before the system can self-improve. Unfortunately I don't see how to do this safely; there would need to be some way to foresee the system becoming extremely powerful before criminals have the chance to do this. Perhaps there can be a public base of the project that a dedicated ethical team works off of, while contributing only some improvements they make back to the public project.
Towards actual friendly AI
Perhaps this system can be used to create actual friendly AI. Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back. What if we don't have a specification? Maybe we can have the system perform induction on friendly AI designs and their ratings (by humans), and then write friendly AI designs that it predicts will have a high rating. This approach to friendly AI will reflect present humans' biases and might cause the system to resort to manipulative tactics to make its design more convincing to humans. Unfortunately I don't see a way to fix this problem without something like CEV.
Conclusion
If this design works, it is a practical way to create a safe, self-improving oracle AI. There are numerous potential issues that might make the system weak or dangerous. On the other hand it will have short-term benefits because it will be able to solve practical problems even before it can self-improve, and it might be easier to get corporations and governments on board. This system might be very useful for solving hard problems before figuring out friendliness theory, and its optimization power might be useful for creating friendly AI. I have not encountered any other self-improving oracle AI designs for which we can be confident that its answers are not underhanded attempts to get us to let it out.
Since I've probably overlooked some significant problems/solutions to problems in this analysis I'd like to hear some more discussion of this design and alternatives to it.
Safe questions to ask an Oracle?
The Future of Humanity Institute wants to pick the brains of the less wrongers :-)
Do you have suggestions for safe questions to ask an Oracle? Interpret the question as narrowly or broadly as you want; new or unusual ideas especially welcome.
Safety can be dangerous
In 2005, Hurricane Rita caused 111 deaths. 3 deaths were caused by the hurricane. 90 were caused by the mass evacuation.
The FDA is supposed to approve new drugs and procedures if the expected benefits outweigh the expected costs. If they actually did this, their errors on both sides (approvals of bad drugs vs. rejections of good drugs) would be roughly equal. The most-publicized drug withdrawal in the past 10 years was that of Vioxx, which the FDA estimated killed a total of 5165 people over 5 years. This suggests that the best drug that the FDA rejected during that decade could have saved 1000 people/year. During that decade, many drugs were (or could have been) approved that might save more than that many lives every year. Gleevec (invented 1993, approved 2001) is believed to save about 10,000 lives a year. Herceptin (invented in the 1980s, began human trials 1991, approved for some patients in 1998, more in 2006, and more in 2010) was estimated to save 1,000 lives a year in the United Kingdom, which would translate to 5,000 lives a year in the US. Patients on Apibaxan (discovered in 2006, not yet approved) have 11% fewer deaths from stroke than patients on warfarin, and stroke causes about 140,000 deaths/year in the US. To stay below the expected drug-rejection error level of 1000 people/year, given just these three drugs (and assuming that Apibaxan pans out and can save 5,000 lives/year), the FDA would need to have a faulty-rejection rate F such that F(10000) + F(5000) + F(5000) < 1000, F < 5%. This seems unlikely.
ADDED: One area where this affects me every day is in branching software repositories. Every software developer agrees that branching the repository head for test versions and for production versions is good practice. Yet branching causes, I would estimate, at least half of our problems with test and production releases. It is common for me to be delayed one to three days while someone figures out that the software isn't running because they issued a patch on one branch and forgot to update the trunk, or forgot to update other development or test versions that are on separate branches. I don't believe in branching anymore - I think we would have fewer bugs if we just did all development on the trunk, and checked out the code when it worked. Branching is good for humongous projects where you have public releases that you can't patch on the head, like Firefox or Linux. But it's out of place for in-house projects where you can just patch the head and re-checkout. The evidence for this in my personal experience as a software developer is overwhelming; yet whenever I suggest not branching, I'm met with incredulity.
Exercise for the reader: Find other cases where cautionary measures are <EDIT>taken past the point of marginal utility</EDIT>.
ADDED: I think that this is the problem: You have observed a distribution of outcome utilities from some category of event followed by you taking some action A. You observe a new instance of this event. You want to predict the outcome utility of action A for this event.
Some categories have a power-law outcome distribution with a negative exponent b, indicating there are fewer events of large importance: number of events of size U = ec - bU. Assume that you don't observe all possible values of U. Events of importance < U0 are too small to observe; and events with large U are very uncommon. It is then difficult to tell whether the category has a power-law distribution without a lot of previous observations.
If a lot of event categories have a distribution like this, where big impacts are bad, and they are usually insignificant but sometimes catastrophic, then it's likely rational to treat these events as if they will be catastrophic. And if you don't have enough observations to know if the distribution is a power-law, or something else, it's rational to treat it as if it were a power-law distribution to be safe.
Could this account for the human risk-aversion "bias"?
If you are the FDA, you are faced with situations where the utility distribution is probably such a power-law distribution mirrored around zero, so there are a few events with very high utility (save lots of lives), and a similar number of events with the negative of that utility (lose that many lives). I would guess that situations like that are rare in our ancestral environment, though I don't know.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)