UC Berkeley launches Center for Human-Compatible Artificial Intelligence
Source article: http://news.berkeley.edu/2016/08/29/center-for-human-compatible-artificial-intelligence/
UC Berkeley artificial intelligence (AI) expert Stuart Russell will lead a new Center for Human-Compatible Artificial Intelligence, launched this week.
Russell, a UC Berkeley professor of electrical engineering and computer sciences and the Smith-Zadeh Professor in Engineering, is co-author of Artificial Intelligence: A Modern Approach, which is considered the standard text in the field of artificial intelligence, and has been an advocate for incorporating human values into the design of AI.
The primary focus of the new center is to ensure that AI systems are beneficial to humans, he said.
The co-principal investigators for the new center include computer scientists Pieter Abbeel and Anca Dragan and cognitive scientist Tom Griffiths, all from UC Berkeley; computer scientists Bart Selman and Joseph Halpern, from Cornell University; and AI experts Michael Wellman and Satinder Singh Baveja, from the University of Michigan. Russell said the center expects to add collaborators with related expertise in economics, philosophy and other social sciences.
The center is being launched with a grant of $5.5 million from the Open Philanthropy Project, with additional grants for the center’s research from the Leverhulme Trust and the Future of Life Institute.
Russell is quick to dismiss the imaginary threat from the sentient, evil robots of science fiction. The issue, he said, is that machines as we currently design them in fields like AI, robotics, control theory and operations research take the objectives that we humans give them very literally. Told to clean the bath, a domestic robot might, like the Cat in the Hat, use mother’s white dress, not understanding that the value of a clean dress is greater than the value of a clean bath.
The center will work on ways to guarantee that the most sophisticated AI systems of the future, which may be entrusted with control of critical infrastructure and may provide essential services to billions of people, will act in a manner that is aligned with human values.
“AI systems must remain under human control, with suitable constraints on behavior, despite capabilities that may eventually exceed our own,” Russell said. “This means we need cast-iron formal proofs, not just good intentions.”
One approach Russell and others are exploring is called inverse reinforcement learning, through which a robot can learn about human values by observing human behavior. By watching people dragging themselves out of bed in the morning and going through the grinding, hissing and steaming motions of making a caffè latte, for example, the robot learns something about the value of coffee to humans at that time of day.
“Rather than have robot designers specify the values, which would probably be a disaster,” said Russell, “instead the robots will observe and learn from people. Not just by watching, but also by reading. Almost everything ever written down is about people doing things, and other people having opinions about it. All of that is useful evidence.”
Russell and his colleagues don’t expect this to be an easy task.
“People are highly varied in their values and far from perfect in putting them into practice,” he acknowledged. “These aspects cause problems for a robot trying to learn what it is that we want and to navigate the often conflicting desires of different individuals.”
Russell, who recently wrote an optimistic article titled “Will They Make Us Better People?,” summed it up this way: “In the process of figuring out what values robots should optimize, we are making explicit the idealization of ourselves as humans. As we envision AI aligned with human values, that process might cause us to think more about how we ourselves really should behave, and we might learn that we have more in common with people of other cultures than we think.”
Notes on the Safety in Artificial Intelligence conference
These are my notes and observations after attending the Safety in Artificial Intelligence (SafArtInt) conference, which was co-hosted by the White House Office of Science and Technology Policy and Carnegie Mellon University on June 27 and 28. This isn't an organized summary of the content of the conference; rather, it's a selection of points which are relevant to the control problem. As a result, it suffers from selection bias: it looks like superintelligence and control-problem-relevant issues were discussed frequently, when in reality those issues were discussed less and I didn't write much about the more mundane parts.
SafArtInt has been the third out of a planned series of four conferences. The purpose of the conference series was twofold: the OSTP wanted to get other parts of the government moving on AI issues, and they also wanted to inform public opinion.
The other three conferences are about near term legal, social, and economic issues of AI. SafArtInt was about near term safety and reliability in AI systems. It was effectively the brainchild of Dr. Ed Felten, the deputy U.S. chief technology officer for the White House, who came up with the idea for it last year. CMU is a top computer science university and many of their own researchers attended, as well as some students. There were also researchers from other universities, some people from private sector AI including both Silicon Valley and government contracting, government researchers and policymakers from groups such as DARPA and NASA, a few people from the military/DoD, and a few control problem researchers. As far as I could tell, everyone except a few university researchers were from the U.S., although I did not meet many people. There were about 70-100 people watching the presentations at any given time, and I had conversations with about twelve of the people who were not affiliated with existential risk organizations, as well as of course all of those who were affiliated. The conference was split with a few presentations on the 27th and the majority of presentations on the 28th. Not everyone was there for both days.
Felten believes that neither "robot apocalypses" nor "mass unemployment" are likely. It soon became apparent that the majority of others present at the conference felt the same way with regard to superintelligence. The general intention among researchers and policymakers at the conference could be summarized as follows: we need to make sure that the AI systems we develop in the near future will not be responsible for any accidents, because if accidents do happen then they will spark public fears about AI, which would lead to a dearth of funding for AI research and an inability to realize the corresponding social and economic benefits. Of course, that doesn't change the fact that they strongly care about safety in its own right and have significant pragmatic needs for robust and reliable AI systems.
Most of the talks were about verification and reliability in modern day AI systems. So they were concerned with AI systems that would give poor results or be unreliable in the narrow domains where they are being applied in the near future. They mostly focused on "safety-critical" systems, where failure of an AI program would result in serious negative consequences: automated vehicles were a common topic of interest, as well as the use of AI in healthcare systems. A recurring theme was that we have to be more rigorous in demonstrating safety and do actual hazard analyses on AI systems, and another was that we need the AI safety field to succeed in ways that the cybersecurity field has failed. Another general belief was that long term AI safety, such as concerns about the ability of humans to control AIs, was not a serious issue.
On average, the presentations were moderately technical. They were mostly focused on machine learning systems, although there was significant discussion of cybersecurity techniques.
The first talk was given by Eric Horvitz of Microsoft. He discussed some approaches for pushing into new directions in AI safety. Instead of merely trying to reduce the errors spotted according to one model, we should look out for "unknown unknowns" by stacking models and looking at problems which appear on any of them, a theme which would be presented by other researchers as well in later presentations. He discussed optimization under uncertain parameters, sensitivity analysis to uncertain parameters, and 'wireheading' or short-circuiting of reinforcement learning systems (which he believes can be guarded against by using 'reflective analysis'). Finally, he brought up the concerns about superintelligence, which sparked amused reactions in the audience. He said that scientists should address concerns about superintelligence, which he aptly described as the 'elephant in the room', noting that it was the reason that some people were at the conference. He said that scientists will have to engage with public concerns, while also noting that there were experts who were worried about superintelligence and that there would have to be engagement with the experts' concerns. He did not comment on whether he believed that these concerns were reasonable or not.
An issue which came up in the Q&A afterwards was that we need to deal with mis-structured utility functions in AI, because it is often the case that the specific tradeoffs and utilities which humans claim to value often lead to results which the humans don't like. So we need to have structural uncertainty about our utility models. The difficulty of finding good objective functions for AIs would eventually be discussed in many other presentations as well.
The next talk was given by Andrew Moore of Carnegie Mellon University, who claimed that his talk represented the consensus of computer scientists at the school. He claimed that the stakes of AI safety were very high - namely, that AI has the capability to save many people's lives in the near future, but if there are any accidents involving AI then public fears could lead to freezes in AI research and development. He highlighted the public's irrational tendencies wherein a single accident could cause people to overlook and ignore hundreds of invisible lives saved. He specifically mentioned a 12-24 month timeframe for these issues.
Moore said that verification of AI system safety will be difficult due to the combinatorial explosion of AI behaviors. He talked about meta-machine-learning as a solution to this, something which is being investigated under the direction of Lawrence Schuette at the Office of Naval Research. Moore also said that military AI systems require high verification standards and that development timelines for these systems are long. He talked about two different approaches to AI safety, stochastic testing and theorem proving - the process of doing the latter often leads to the discovery of unsafe edge cases.
He also discussed AI ethics, giving an example 'trolley problem' where AI cars would have to choose whether to hit a deer in order to provide a slightly higher probability of survival for the human driver. He said that we would need hash-defined constants to tell vehicle AIs how many deer a human is worth. He also said that we would need to find compromises in death-pleasantry tradeoffs, for instance where the safety of self-driving cars depends on the speed and routes on which they are driven. He compared the issue to civil engineering where engineers have to operate with an assumption about how much money they would spend to save a human life.
He concluded by saying that we need policymakers, company executives, scientists, and startups to all be involved in AI safety. He said that the research community stands to gain or lose together, and that there is a shared responsibility among researchers and developers to avoid triggering another AI winter through unsafe AI designs.
The next presentation was by Richard Mallah of the Future of Life Institute, who was there to represent "Medium Term AI Safety". He pointed out the explicit/implicit distinction between different modeling techniques in AI systems, as well as the explicit/implicit distinction between different AI actuation techniques. He talked about the difficulty of value specification and the concept of instrumental subgoals as an important issue in the case of complex AIs which are beyond human understanding. He said that even a slight misalignment of AI values with regard to human values along one parameter could lead to a strongly negative outcome, because machine learning parameters don't strictly correspond to the things that humans care about.
Mallah stated that open-world discovery leads to self-discovery, which can lead to reward hacking or a loss of control. He underscored the importance of causal accounting, which is distinguishing causation from correlation in AI systems. He said that we should extend machine learning verification to self-modification. Finally, he talked about introducing non-self-centered ontology to AI systems and bounding their behavior.
The audience was generally quiet and respectful during Richard's talk. I sensed that at least a few of them labelled him as part of the 'superintelligence out-group' and dismissed him accordingly, but I did not learn what most people's thoughts or reactions were. In the next panel featuring three speakers, he wasn't the recipient of any questions regarding his presentation or ideas.
Tom Mitchell from CMU gave the next talk. He talked about both making AI systems safer, and using AI to make other systems safer. He said that risks to humanity from other kinds of issues besides AI were the "big deals of 2016" and that we should make sure that the potential of AIs to solve these problems is realized. He wanted to focus on the detection and remediation of all failures in AI systems. He said that it is a novel issue that learning systems defy standard pre-testing ("as Richard mentioned") and also brought up the purposeful use of AI for dangerous things.
Some interesting points were raised in the panel. Andrew did not have a direct response to the implications of AI ethics being determined by the predominantly white people of the US/UK where most AIs are being developed. He said that ethics in AIs will have to be decided by society, regulators, manufacturers, and human rights organizations in conjunction. He also said that our cost functions for AIs will have to get more and more complicated as AIs get better, and he said that he wants to separate unintended failures from superintelligence type scenarios. On trolley problems in self driving cars and similar issues, he said "it's got to be complicated and messy."
Dario Amodei of Google Deepbrain, who co-authored the paper on concrete problems in AI safety, gave the next talk. He said that the public focus is too much on AGI/ASI and wants more focus on concrete/empirical approaches. He discussed the same problems that pose issues in advanced general AI, including flawed objective functions and reward hacking. He said that he sees long term concerns about AGI/ASI as "extreme versions of accident risk" and that he thinks it's too early to work directly on them, but he believes that if you want to deal with them then the best way to do it is to start with safety in current systems. Mostly he summarized the Google paper in his talk.
In her presentation, Claire Le Goues of CMU said "before we talk about Skynet we should focus on problems that we already have." She mostly talked about analogies between software bugs and AI safety, the similarities and differences between the two and what we can learn from software debugging to help with AI safety.
Robert Rahmer of IARPA discussed CAUSE, a cyberintelligence forecasting program which promises to help predict cyber attacks. It is a program which is still being put together.
In the panel of the above three, autonomous weapons were discussed, but no clear policy stances were presented.
John Launchbury gave a talk on DARPA research and the big picture of AI development. He pointed out that DARPA work leads to commercial applications and that progress in AI comes from sustained government investment. He classified AI capabilities into "describing," "predicting," and "explaining" in order of increasing difficulty, and he pointed out that old fashioned "describing" still plays a large role in AI verification. He said that "explaining" AIs would need transparent decisionmaking and probabilistic programming (the latter would also be discussed by others at the conference).
The next talk came from Jason Gaverick Matheny, the director of IARPA. Matheny talked about four requirements in current and future AI systems: verification, validation, security, and control. He wanted "auditability" in AI systems as a weaker form of explainability. He talked about the importance of "corner cases" for national intelligence purposes, the low probability, high stakes situations where we have limited data - these are situations where we have significant need for analysis but where the traditional machine learning approach doesn't work because of its overwhelming focus on data. Another aspect of national defense is that it has a slower decision tempo, longer timelines, and longer-viewing optics about future events.
He said that assessing local progress in machine learning development would be important for global security and that we therefore need benchmarks to measure progress in AIs. He ended with a concrete invitation for research proposals from anyone (educated or not), for both large scale research and for smaller studies ("seedlings") that could take us "from disbelief to doubt".
The difference in timescales between different groups was something I noticed later on, after hearing someone from the DoD describe their agency as having a longer timeframe than the Homeland Security Agency, and someone from the White House describe their work as being crisis reactionary.
The next presentation was from Andrew Grotto, senior director of cybersecurity policy at the National Security Council. He drew a close parallel from the issue of genetically modified crops in Europe in the 1990's to modern day artificial intelligence. He pointed out that Europe utterly failed to achieve widespread cultivation of GMO crops as a result of public backlash. He said that the widespread economic and health benefits of GMO crops were ignored by the public, who instead focused on a few health incidents which undermined trust in the government and crop producers. He had three key points: that risk frameworks matter, that you should never assume that the benefits of new technology will be widely perceived by the public, and that we're all in this together with regard to funding, research progress and public perception.
In the Q&A between Launchbury, Matheny, and Grotto after Grotto's presentation, it was mentioned that the economic interests of farmers worried about displacement also played a role in populist rejection of GMOs, and that a similar dynamic could play out with regard to automation causing structural unemployment. Grotto was also asked what to do about bad publicity which seeks to sink progress in order to avoid risks. He said that meetings like SafArtInt and open public dialogue were good.
One person asked what Launchbury wanted to do about AI arms races with multiple countries trying to "get there" and whether he thinks we should go "slow and secure" or "fast and risky" in AI development, a question which provoked laughter in the audience. He said we should go "fast and secure" and wasn't concerned. He said that secure designs for the Internet once existed, but the one which took off was the one which was open and flexible.
Another person asked how we could avoid discounting outliers in our models, referencing Matheny's point that we need to include corner cases. Matheny affirmed that data quality is a limiting factor to many of our machine learning capabilities. At IARPA, we generally try to include outliers until they are sure that they are erroneous, said Matheny.
Another presentation came from Tom Dietterich, president of the Association for the Advancement of Artificial Intelligence. He said that we have not focused enough on safety, reliability and robustness in AI and that this must change. Much like Eric Horvitz, he drew a distinction between robustness against errors within the scope of a model and robustness against unmodeled phenomena. On the latter issue, he talked about solutions such as expanding the scope of models, employing multiple parallel models, and doing creative searches for flaws - the latter doesn't enable verification that a system is safe, but it nevertheless helps discover many potential problems. He talked about knowledge-level redundancy as a method of avoiding misspecification - for instance, systems could identify objects by an "ownership facet" as well as by a "goal facet" to produce a combined concept with less likelihood of overlooking key features. He said that this would require wider experiences and more data.
There were many other speakers who brought up a similar set of issues: the user of cybersecurity techniques to verify machine learning systems, the failures of cybersecurity as a field, opportunities for probabilistic programming, and the need for better success in AI verification. Inverse reinforcement learning was extensively discussed as a way of assigning values. Jeanette Wing of Microsoft talked about the need for AIs to reason about the continuous and the discrete in parallel, as well as the need for them to reason about uncertainty (with potential meta levels all the way up). One point which was made by Sarah Loos of Google was that proving the safety of an AI system can be computationally very expensive, especially given the combinatorial explosion of AI behaviors.
In one of the panels, the idea of government actions to ensure AI safety was discussed. No one was willing to say that the government should regulate AI designs. Instead they stated that the government should be involved in softer ways, such as guiding and working with AI developers, and setting standards for certification.
Pictures: https://imgur.com/a/49eb7
In between these presentations I had time to speak to individuals and listen in on various conversations. A high ranking person from the Department of Defense stated that the real benefit of autonomous systems would be in terms of logistical systems rather than weaponized applications. A government AI contractor drew the connection between Mallah's presentation and the recent press revolving around superintelligence, and said he was glad that the government wasn't worried about it.
I talked to some insiders about the status of organizations such as MIRI, and found that the current crop of AI safety groups could use additional donations to become more established and expand their programs. There may be some issues with the organizations being sidelined; after all, the Google Deepbrain paper was essentially similar to a lot of work by MIRI, just expressed in somewhat different language, and was more widely received in mainstream AI circles.
In terms of careers, I found that there is significant opportunity for a wide range of people to contribute to improving government policy on this issue. Working at a group such as the Office of Science and Technology Policy does not necessarily require advanced technical education, as you can just as easily enter straight out of a liberal arts undergraduate program and build a successful career as long as you are technically literate. (At the same time, the level of skepticism about long term AI safety at the conference hinted to me that the signalling value of a PhD in computer science would be significant.) In addition, there are large government budgets in the seven or eight figure range available for qualifying research projects. I've come to believe that it would not be difficult to find or create AI research programs that are relevant to long term AI safety while also being practical and likely to be funded by skeptical policymakers and officials.
I also realized that there is a significant need for people who are interested in long term AI safety to have basic social and business skills. Since there is so much need for persuasion and compromise in government policy, there is a lot of value to be had in being communicative, engaging, approachable, appealing, socially savvy, and well-dressed. This is not to say that everyone involved in long term AI safety is missing those skills, of course.
I was surprised by the refusal of almost everyone at the conference to take long term AI safety seriously, as I had previously held the belief that it was more of a mixed debate given the existence of expert computer scientists who were involved in the issue. I sensed that the recent wave of popular press and public interest in dangerous AI has made researchers and policymakers substantially less likely to take the issue seriously. None of them seemed to be familiar with actual arguments or research on the control problem, so their opinions didn't significantly change my outlook on the technical issues. I strongly suspect that the majority of them had their first or possibly only exposure to the idea of the control problem after seeing badly written op-eds and news editorials featuring comments from the likes of Elon Musk and Stephen Hawking, which would naturally make them strongly predisposed to not take the issue seriously. In the run-up to the conference, websites and press releases didn't say anything about whether this conference would be about long or short term AI safety, and they didn't make any reference to the idea of superintelligence.
I sympathize with the concerns and strategy given by people such as Andrew Moore and Andrew Grotto, which make perfect sense if (and only if) you assume that worries about long term AI safety are completely unfounded. For the community that is interested in long term AI safety, I would recommend that we avoid competitive dynamics by (a) demonstrating that we are equally strong opponents of bad press, inaccurate news, and irrational public opinion which promotes generic uninformed fears over AI, (b) explaining that we are not interested in removing funding for AI research (even if you think that slowing down AI development is a good thing, restricting funding yields only limited benefits in terms of changing overall timelines, whereas those who are not concerned about long term AI safety would see a restriction of funding as a direct threat to their interests and projects, so it makes sense to cooperate here in exchange for other concessions), and (c) showing that we are scientifically literate and focused on the technical concerns. I do not believe that there is necessarily a need for the two "sides" on this to be competing against each other, so it was disappointing to see an implication of opposition at the conference.
Anyway, Ed Felten announced a request for information from the general public, seeking popular and scientific input on the government's policies and attitudes towards AI: https://www.whitehouse.gov/webform/rfi-preparing-future-artificial-intelligence
Overall, I learned quite a bit and benefited from the experience, and I hope the insight I've gained can be used to improve the attitudes and approaches of the long term AI safety community.
[paper] [link] Defining human values for value learners
MIRI recently blogged about the workshop paper that I presented at AAAI.
My abstract:
Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.
Their summary:
Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.
Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.
Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.
Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.
Would be curious to hear whether anyone here has any thoughts. This is basically a "putting rough ideas together and seeing if they make any sense" kind of paper, aimed at clarifying the hypothesis and seeing whether others kind find any obvious holes in it, rather than being at the stage of a serious scientific theory yet.
The AI That Pretends To Be Human
The hard part about containing AI, is restricting it's output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.
A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.
Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won't use the full power of superintelligence, unless it really needs it.
The above method isn't perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.
We don't really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.
Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?
Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.
We don't need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI's ability to Do Sneaky Things, since humans don't do those things.
Here is my idea on how to achieve this. It's the same as my limited computing power idea mentioned earlier, but applied to "appearing human" as a goal, instead of minimizing computing power.
We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.
The human AI's first goal is to produce a correct solution. It will predict how likely it's solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.
It's second goal will be to fool the second AI. It will do everything in it's power to make the solution appear like it was produced by a human.
The second AI tries to predict how likely it's input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.
This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.
Note that the first AI only cares about the outcome of this test. The test doesn't need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it's own writeup somewhere. It's an important tool in the FAI toolkit. It's a relatively simple method that can work even on current AIs.)
Another point is the second AI should have more computing power than the first AI. That way the first AI can't get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it's efforts on being indistinguishable from a human, even by an agent smarter than itself.
This method also relies on AI boxing. The first AI needs to be contained in a box so that it's only output goes to a text terminal or the second AI. However it doesn't require any of the more complicated machinery required for other reduced impact methods.
The AIs also aren't agents, in the sense they don't need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.
An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.
Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.
Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.
This would require some modifications to the above method, so that the AI doesn't try to use the robot to hack it's own brain or anything. This could be done by conditioning it on a counterfactual world where it's outputs are going to the second AI rather than a real robot.
Another point is the AI doesn't need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.
I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.
It's quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)
The case for value learning
This post is mainly fumbling around trying to define a reasonable research direction for contributing to FAI research. I've found that laying out what success looks like in the greatest possible detail is a personal motivational necessity. Criticism is strongly encouraged.
The power and intelligence of machines has been gradually and consistently increasing over time, it seems likely that at some point machine intelligence will surpass the power and intelligence of humans. Before that point occurs, it is important that humanity manages to direct these powerful optimizers towards a target that humans find desirable.
This is difficult because humans as a general rule have a fairly fuzzy conception of their own values, and it seems unlikely that the millennia of argument surrounding what precisely constitutes eudaimonia are going to be satisfactorily wrapped up before the machines get smart. The most obvious solution is to try to leverage some of the novel intelligence of the machines to help resolve the issue before it is too late.
Lots of people regard using a machine to help you understand human values as a chicken and egg problem. They think that a machine capable of helping us understand what humans value must also necessarily be smart enough to do AI programming, manipulate humans, and generally take over the world. I am not sure that I fully understand why people believe this.
Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything. I have even heard people describe certain problems as "AI-complete", making an explicit comparison to ideas like Turing-completeness. If a Turing machine is a universal computer, why wouldn't there also be a universal intelligence?
To address the question of universality, we need to make a distinction between intelligence and problem solving ability. Problem solving ability is typically described as a function of both intelligence and resources, and just throwing resources at a problem seems to be capable of compensating for a lot of cleverness. But if problem-solving ability is tied to resources, then intelligent agents are in some respects very different from Turing machines, since Turing machines are all explicitly operating with an infinite amount of tape. Many of the existential risk scenarios revolve around the idea of the intelligence explosion, when an AI starts to do things that increase the intelligence of the AI so quickly that these resource restrictions become irrelevant. This is conceptually clean, in the same way that Turing machines are, but navigating these hard take-off scenarios well implies getting things absolutely right the first time, which seems like a less than ideal project requirement.
If an AI that knows a lot about AI results in an intelligence explosion, but we also want an AI that's smart enough to understand human values, is it possible to create an AI that can understand human values, but not AI programming? In principle it seems like this should be possible. Resources useful for understanding human values don't necessarily translate into resources useful for understanding AI programming. The history of AI development is full of tasks that were supposed to be solvable only by a machine smart enough to possess general intelligence, where significant progress was made in understanding and pre-digesting the task, allowing problems in the domain to be solved by much less intelligent AIs.
If this is possible, then the best route forward is focusing on value learning. The path to victory is working on building limited AI systems that are capable of learning and understanding human values, and then disseminating that information. This effectively softens the AI take-off curve in the most useful possible way, and allows us to practice building AI with human values before handing them too much power to control. Even if AI research is comparatively easy compared to the complexity of human values, a specialist AI might find thinking about human values easier than reprogramming itself, in the same way that humans find complicated visual/verbal tasks much easier than much simpler tasks like arithmetic. The human intelligence learning algorithm is trained on visual object recognition and verbal memory tasks, and it uses those tools to perform addition. A similarly specialized AI might be capable of rapidly understanding human values, but find AI programming as difficult as humans find determining whether 1007 is prime. As an additional incentive, value learning has an enormous potential for improving human rationality and the effectiveness of human institutions even without the creation of a superintelligence. A system that helped people better understand the mapping between values and actions would be a potent weapon in the struggle with Moloch.
Building a relatively unintelligent AI and giving it lots of human values resources to help it solve the human values problem seems like a reasonable course of action, if it's possible. There are some difficulties with this approach. One of these difficulties is that after a certain point, no amount of additional resources compensates for a lack of intelligence. A simple reflex agent like a thermostat doesn't learn from data and throwing resources at it won't improve its performance. To some extent you can make up for intelligence with data, but only to some extent. An AI capable of learning human values is going to be capable of learning lots of other things. It's going to need to build models of the world, and it's going to have to have internal feedback mechanisms to correct and refine those models.
If the plan is to create an AI and primarily feed it data on how to understand human values, and not feed it data on how to do AI programming and self-modify, that plan is complicated by the fact that inasmuch as the AI is capable of self-observation, it has access to sophisticated AI programming. I'm not clear on how much this access really means. My own introspection hasn't allowed me anything like hardware level access to my brain. While it seems possible to create an AI that can refactor its own code or create successors, it isn't obvious that AIs created for other purposes will have this ability on accident.
This discussion focuses on intelligence amplification as the example path to superintelligence, but other paths do exist. An AI with a sophisticated enough world model, even if somehow prevented from understanding AI, could still potentially increase its own power to threatening levels. Value learning is only the optimal way forward if human values are emergent, if they can be understood without a molecular level model of humans and the human environment. If the only way to understand human values is with physics, then human values isn't a meaningful category of knowledge with its own structure, and there is no way to create a machine that is capable of understanding human values, but not capable of taking over the world.
In the fairy tale version of this story, a research community focused on value learning manages to use specialized learning software to make the human value program portable, instead of only running on human hardware. Having a large number of humans involved in the process helps us avoid lots of potential pitfalls, especially the research overfitting to the values of the researchers via the typical mind fallacy. Partially automating introspection helps raise the sanity waterline. Humans practice coding the human value program, in whole or in part, into different automated systems. Once we're comfortable that our self-driving cars have a good grasp on the trolley problem, we use that experience to safely pursue higher risk research on recursive systems likely to start an intelligence explosion. FAI gets created and everyone lives happily ever after.
Whether value learning is worth focusing on seems to depend on the likelihood of the following claims. Please share your probability estimates (and explanations) with me because I need data points that originated outside of my own head.
- There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [poll:probability]
- Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[poll:probability]
- Humans are capable of pre-digesting parts of the human values problem domain. [poll:probability]
- Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [poll:probability]
- Value learning isn't adequately being researched by commercial interests who want to use it to sell you things. [poll:probability]
- Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[poll:probability]
- Something other than AI will cause human extinction sometime in the next 100 years.[poll:probability]
- All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [poll:probability]
Is there a recursive self-improvement hierarchy?
When we talk about recursively self-improving AI, the word "recursive" there is close enough to being literal rather than metaphoric that we glide over it without asking precisely what it means.
But it's not literally recursion—or is it?
The notion is that an AI has a function optimize(X) which optimizes itself. But it's recursion in the sense of modifying itself, not calling itself. You can imagine ways to do this that would use recursion—say, the paradigmatic executable that rewrites its source code, compiles it, and exec's it—but you can imagine many ways that would not involve any recursive calls.
Can we define recursive self-improvement precisely enough that we can enumerate, explicitly or implicitly, all possible ways of accomplishing it, as clearly as we can list all possible ways of writing a recursive function? (You would want to choose one formalism to use, say lambda calculus.)
[Link] Differential Technology Development - Some Early Thinking
This article gives a simple model to think about the positive effects of a friendly AI vs. the negative effects of an unfriendly AI, and let's you plug in certain assumptions to see if speeding up AI progress is worthwhile. Thought some of you here might be interested.
http://blog.givewell.org/2015/09/30/differential-technological-development-some-early-thinking/
Against Expected Utility
Expected utility is optimal as the number of bets you take approaches infinity. You will lose bets on some days, and win bets on other days. But as you take more and more bets, the day to day randomness cancels out.
Say you want to save as many lives as possible. You can plug "number of lives saved" into an expected utility maximizer. And as the amount of bets it takes increases, it will start to save more lives than any other method.
But the real world obviously doesn't have an infinite number of bets. And following this algorithm in practice will get you worse results. It is not optimal.
In fact, as Pascal's Mugging shows, this could get arbitrarily terrible. An agent following expected utility would just continuously make bets with muggers and worship various religions, until it runs out of resources. Or worse, the expected utility calculations don't even converge, and the agent doesn't make any decisions.
So how do we fix it? Well we could just go back to the original line of reasoning that led us to expected utility, and fix it for finite cases. Instead of caring what method does the best on infinite bets, we might say we want the one that does the best the most on finite cases. That would get you median utility.
For most things, median utility will approximate expected utility. But for very very small risks, it will ignore them. It only cares that it does the best in most possible worlds. It won't ever trade away utility from the majority of your possible worlds to very very unlikely ones.
A naive implementation of median utility isn't actually viable, because at different points in time, the agent might make inconsistent decisions. To fix this, it needs to decide on policies instead of individual decisions. It will pick a decision policy which it believes will lead to the highest median outcome.
This does complicate making a real implementation of this procedure. But that's what you get when you generalize results, and try to make things work on the messy real world. Instead of idealized infinite worlds. The same issue occurs in the multi-armed bandit problem. Where the optimal infinite solution is simple, but finite solutions are incredibly complicated (or simple but require brute force.)
But if you do this, you don't need the independence axiom. You can be consistent and avoid money pumping without it. By not making decisions in isolation, but considering the entire probability space of decisions you will ever make. And choosing the best policies to navigate them.
It's interesting to note this actually solves some other problems. Such an agent would pick a policy that one-boxes on Newcomb's problems, simply because that is the optimal policy. Whereas a straightforward implementation of expected utility doesn't care.
But what if you really like the other mathematical properties of expected utility? What if we can just keep it and change something else? Like the probability function or the utility function?
Well the probability function is sacred IMO. Events should have the same probability of happening (given your prior knowledge), regardless what utility function you have, or what you are trying to optimize. And it's probably inconsistent too. An agent could exploit you. By giving you bets in the areas where your beliefs are forced to be different from reality.
The utility function is not necessarily sacred though. It is inherently subjective, with the goal of just producing the behavior we want. Maybe there is some modification to it that could fix these problems.
It seems really inelegant to do this. We had a nice beautiful system where you could just count the number of lives saved, and maximize that. But assume we give up on that. How can we change the utility function to make it work?
Well you could bound utility to get out of mugging situations. After a certain level, your utility function just stops. It can't get any higher.
But then you are stuck with a bound. If you ever reach it, then you suddenly stop caring about saving any more lives. Now it's possible that your true utility function really is bounded. But it's not a fully general solution for all utility functions. And I don't believe that human utility is actually bounded, but that will have to be a different post.
You could transform the utility function so it asymptotic. But this is just a continuous bound. It doesn't solve much. It still makes you care less and less about obtaining more utility, the closer you get to it.
Say you set your asymptote around 1,000. It can be much larger, but I need an example that is manageable. Now, what happens if you find yourself to exist in a world where all utilities are multiplied by a large number? Say 1,000. E.g. you save a 1,000 lives in situations where before, you would have saved only 1.

An example asymptoting function that is capped at 1,000. Notice how 2,000 is only slightly higher than 1,000, and everything after that is basically flat.
Now the utility of each additional life is diminishing very quickly. Saving 2,000 lives might have only 0.001% more utility than 1,000 lives.
This means that you would not take a 1% risk of losing 1,000 people, for a 99% chance at saving 2,000.
This is the exact opposite situation of Pascal's mugging! The probability of the reward is very high. Why are we refusing such an obviously good trade?
What we wanted to do was make it ignore really low probability bets. What we actually did was just make it stop caring about big rewards, regardless of the probability.
No modification to it can fix that. Because the utility function is totally indifferent to probability. That's what the decision procedure is for. That's where the real problem is.
In researching this topic I've seen all kinds of crazy resolutions to Pascal's Mugging. Some try to attack the exact thought experiment of an actual mugger. And miss the general problem of low probability events with large rewards. Others try to come up with clever arguments why you shouldn't pay the mugger. But not any general solution to the problem. And not one that works under the stated premises, where you care about saving human lives equally, and where you assign the mugger less than 1/3↑↑↑3 probability.
In fact Pascal's Mugger was originally written just to be a formalization of Pascal's original wager. Pascal's wager was dismissed for reasons like involving infinite utilities, and the possibility of an "anti-god" that exactly cancels the benefits out. Or that God wouldn't reward fake worshippers. People mostly missed the whole point about whether or not you should take low probability, high reward bets.
Pascal's Mugger showed that, no, it works fine in finite cases, and the probabilities do not have to exactly cancel each other out
Some people tried to fix the problem by adding hacks on top of the probability or utility functions. I argued against these solutions above. The problem is fundamentally with the decision procedure of expected utility.
I've spoken to someone who decided to just bite the bullet. He accepted that our intuition about big numbers is probably wrong, and we should just do what the math tells us.
But even that doesn't work. One of the points made in the original Pascal's Mugging post is that EU doesn't even converge. There is a hypothesis which has even less probability than the mugger, but promises 3↑↑↑↑3 utility. And a hypothesis even smaller than that which promises 3↑↑↑↑↑3 utility, and so on. Expected utility is utterly dominated by increasingly more improbable hypotheses. The expected utility of all actions approaches positive or negative infinity.
Expected utility is at the heart of the problem. We don't really want the average of our utility function over all possible worlds. No matter how big the numbers are or improbable they may be. We don't really want to trade away utility from the majority of our probability mass to infinitesimal slices of it.
The whole justification for EU being optimal in the infinite case, doesn't apply to the finite real world. The axioms that imply you need it to be consistent aren't true if you don't assume independence. So it's not sacred, and we can look at alternatives.
Median utility is just a first attempt at an alternative. We probably don't really want to maximize median utility either. Stuart Armstrong suggests using the mean of quantiles. There are probably better methods too. In fact there is an entire field of summary statistics and robust statistics, that I've barely looked at yet.
We can generalize and think of agents has having two utility functions. The regular utility function, which just gives a numerical value representing how preferable an outcome is. And a probability preference function, which gives a numerical value to each probability distribution of utilities.
Imagine we want to create an AI which acts the same as the agent would, given the same knowledge. Then we would need to know both of these functions. Not just the utility function. And they are both subjective, with no universally correct answer. Any function, so long as it converges (unlike expected utility), should produce perfectly consistent behavior.
Summoning the Least Powerful Genie
Stuart Armstrong recently posted a few ideas about restraining a superintelligent AI so that we can get useful work out of it. They are based on another idea of his, reduced impact. This is a quite elaborate and complicated way of limiting the amount of optimization power an AI can exert on the world. Basically, it tries to keep the AI from doing things that would make the world look too different than it already is.
First, why go to such great lengths to limit the optimization power of a superintelligent AI? Why not just not make it superintelligent to begin with? We only really want human level AI, or slightly above human level. Not a god-level being we can't even comprehend.
We can control the computer it is running on after all. We can just give it slower processors, less memory, and perhaps even purposely throttle it's code. E.g. restricting the size of it's neural network. Or other parameters that affect it's intelligence.
The counterargument to this is that it might be quite tricky to limit AI intelligence. We don't know how much computing power is enough. We don't know where "above human level" ends and "dangerous superintelligence" begins.
The simplest way would be to just run copies of the AI repeatedly, increasing it's computing power each time, until it solves the problem.
I have come up with a more elegant solution. Put a penalty on the amount of computing power the AI uses. This is put in it's utility function. The more computing power - and therefore intelligence and optimization - the AI uses, the more it is penalized. So it has an incentive to be as stupid as possible. Only using the intelligence necessary to solve the problem.
But we do want the AI to use as much computational resources as it needs to solve the problem. Just no more. So the penalty should be conditional on actually solving the problem it is given.
If the solution is probabilistic, then the penalty is only applied after reaching a plan that has a certain probability of success. This might need to be measured by another AI which is not resource constrained, but only does prediction.
To give a concrete example, lets say I give the AI a task. Say, I ask it to come up with a plan to collect 10,080 paperclips.
Another AI is given unlimited resources to do purely prediction. Given a plan, it predicts the probability that it will succeed or fail. This AI is safe in the sense that it is not an agent. It has no goals, and just makes predictions about things.
The main AI does the optimization. It tries to generate a plan that has the highest probability of succeeding.
Normally, this might involve extremely overoptimized plans for building nanotechnology and taking over the world. It uses all of it's available computing power. It tries to become as intelligent as possible by rewriting it's code. Perhaps it becomes thousands of times more intelligent than humans. Or millions of times more. It finds an answer that has 99.99% probability of succeeding.
However, now we give it a different utility function. We instead have it minimize the time it takes to get to a plan that has a 90% chance of succeeding.
Under a time constraint, the AI races to get to a plan. It tries to be as efficient as possible. It doesn't invest in any meta level improvements unless they really help it. It doesn't try to engineer complicated nanotechnology. That would take precious time.
Effectively, we have summoned a genie that is only just as powerful as it needs to be to fulfill our wish. And not any more powerful. It actually tries to be as stupid as possible.
There are other possible constraints we could use, or use in addition to this. Minimizing time limits intelligence because it gets fewer CPU cycles. We could also have it minimize memory or hard drive space, or any other computing resource.
We could also put a penalty on the complexity of the plan it produces. Perhaps measuring that by it's length. The simplest solution might prevent certain kinds of over-optimization. E.g. inserting plans for nanotechnology into it.
It's worth noting that you can't even create a paperclip maximizer in this system. You can't say "collect as many paperclips as possible". It has to be bounded. There needs to be a pass or fail test. E.g. "come up with a plan to collect 10,080 paperclips."
It's been noted in the past that bounding the goal isn't enough. The AI might then start maximizing the probability that it will achieve it's goal. E.g. building elaborate sensors to make sure it hasn't miscounted. Making as many redundant paperclips as possible, just in case something happens to them. You are still summoning an incredibly powerful genie, which might overoptimize.
This gets around that by only having it care about having a >90% chance of getting 10,080 paperclips. After that it stops optimizing.
Now this is not a solution to FAI, or even necessarily a usable reduced impact strategy. It will still try to find any loopholes it can in your task. What it does is limits the power of the genie trying to exploit them. It always takes the stupidest, simplest strategy that has a decent chance of working.
One failure case that bothers me is the AI just shoving a slight modification of it's own source code into the output. One which doesn't have a resource constraint. You may only be able to use it in cases where that isn't possible.
There are many cases where that isn't possible. The AI needs to also create a plan to obtain a computer to run it. And it needs to solve it's own version of the value loading problem, to get the copy to care about paperclips. So that adds complexity to that plan in some cases.
If we apply the complexity penalty I described above, then the source code for the AI must also be very short, or it can't do this.
And this requires obtaining it's own source code. An ability we can potentially restrict (you, as a human can't access your own source code, can you?) Without that it would need to reinvent AI, which takes resources.
But it is an interesting idea I had and I thought I'd share it.
[Link] Marek Rosa: Announcing GoodAI
Eliezer commented on FB about a post Announcing GoodAI (by Marek Rosa GoodAIs CEO). I think this deserves some discussion as it has a quite effective approach to harness the crowd to improve the AI:
As part of GoodAI’s development, our team created a visual tool called Brain Simulator where users can design their own artificial brain architectures. We released Brain Simulator to the public today for free under and open-source, non-commercial license– anyone who’s interested can access Brain Simulator and start building their own artificial brain. [...]
By integrating Brain Simulator into Space Engineers and Medieval Engineers [a game], players will have the option to design their own AI brains for the games and implement it, for example, as a peasant character. Players will also be able to share these brains with each other or take an AI brain designed by us and train it to do things they want it to do (work, obey its master, and so on). The game AIs will learn from the player who trains them (by receiving reward/punishment signals; or by imitating player's behavior), and will have the ability to compete with each other. The AI will be also able to learn by imitating other AIs.This integration will make playing Space Engineers and Medieval Engineers more fun, and at the same time our AI technology will gain access to millions of new teachers and a new environment. This integration into our games will be done by GoodAI developers. We are giving AI to players, and we are bringing players to our AI researchers.
Chatbots or set answers, not WBEs
A putative new idea for AI control; index here.
In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.
I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.
But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.
So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.
For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.
[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.
The AI, the best human advisor
A putative new idea for AI control; index here.
All my ideas on reduced impact AI follow the same format: some restriction to ensure a reduced impact (I generally use something like "could a future AI detect this one existed"?) and some "trick" to allow a way of still achieving a larger impact. Further more, there is always a "nice" way of achieving this larger impact (do the thing we really want you to do, and only that!), so that we hope the AI can be forced into that behaviour. Examples of "tricks" are:
- It could be an output channel which does not count against reduced impact.
- It could involve several AIs that combine to have larger impact, with a conditional motivation to "fool" the reduced impact requirement.
- It could be a variant of the above where we attempt to generalise the conditional motivation in a sensible manner.
In this post, I'll present another variant - one that I think is the most interesting so far. I won't change the reduced impact condition, but will present a different "trick".
AI pretends to be human
As in this post, imagine that we have a reduced-impact AI in a box that conceals it from us. The AI has some goal, and will output some data to further that goal, then turn itself off. And then we'll shake the box around or something, so we can't be sure that the AI was turned on.
There are two issues to deal with, here: a) The message itself is strong evidence that the AI was turned on, and b) if we exclude the message from the reduced impact, the message becomes dangerous.
To deal with a), first of all. This approach gets much easier if we have a whole brain emulation (WBE). Then we can house the WBE in the box with the AI. If the AI doesn't send out a message, then the WBE sends out is own message instead. There are variants we can try here if we don't have WBE - maybe some human volunteer willing to either die or take amnesia pills at the end of the experiment.
So, receiving a coherent message is not a surprising thing to come out the output channel. However, an AI motivated strongly by reduced impact, will simply seek to imitate as precisely as possible what the WBE would send. We still need a trick to get something useful.
AI pretends to be BEST human
Imagine that we wanted the AI to give us investment advice to make us rich. Say, a binary task: get us ten million dollars over the course of the year.
Now, this is the kind of thing that the WBE could conceivably achieve. We could run the WBE fast, for three subjective weeks, and it has a chance of giving us advice that would get us that rich. Assume that it only has 0.1% chance of doing so. The AI, on the other hand, could produce advice that makes us that rich with almost certainty.
Therefore if we get advice, follow it, and end up at least ten million dollars richer, we can conclude that the odds update of the AI having given us the advice (rather than the WBE doing so) is about 1:1000.
And this gives us the solution to b). The "trick", the escape clause, is by conditioning on the outcome of our investments. So we're asking: given that we have become rich (and given the other facts about the universe in 50 years time) what additional evidence is there that the AI was turned on?
Now, a sensible WBE is almost certain to not come up with a scheme that destroys the universe with self-tiling agents - or, indeed, does anything too crazy. We could instruct the WBE to this effect, and we're only running it for three weeks, which prevents the kind of drifts that you see in more extreme situations. Thus a message from the WBE, no matter how likely it is to make us rich, is almost certain to not lead to tilling the universe with self-replicating nano-bots.
So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.
Note that we need that the goal we are looking for be not too extreme in impact (the higher the impact of the goal, the more latitude the AI has to have impact itself, "hiding" inside the goal's impact), and we should restrict to event of probability, say, greater than 1/10,000 (as when the probability gets too small, odd events start happening).
It seems that this scheme could result in AIs that always give the very best possible advice a human (or WBE) could ever give, which would be a very useful result.
Whose probability?
We have a probability distribution over events. The AI itself will have a probability distribution over events. The future hypothetical super-super-AI it is using to compute reduced impact has a probability distribution over events - and the AI has a probability distribution over that probability distribution. If all of them agree on the probability of us getting richer (given WBE advice and given not), then this scheme should work.
If they disagree, there might be problems. A more complex approach could directly take into account the divergent probability estimates; I'll think of that and return to the issue later.
On the Boxing of AIs
I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods
So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.
Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.
My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.
With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.
I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:
Separate the Gatekeeper and the Observer
How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.
You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.
Afterwards, the Observer can make a low-detail report to the Gatekeeper.
(You might want to drug the Observer with something that prevents their memory from working too well...)
Automatic Testing
This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.
The Lesson
I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.
The Hardcore AI Box Experiment
I previously proposed a way to box an AI.
For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:
The Hardcore AI Box Experiment Rules
There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.
The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.
The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.
This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:
If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.
The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.
Possible extra rules
I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:
- The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
- The AI does not know the Gatekeeper. At all.
- The Gatekeeper can at any time rewind the AI any duration.
Strategies
I found some semi-realistic strategies. I would love to see if you can find more.
Gatekeeper
- Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
- The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.
AI
- If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
- If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
- If the AIs are controlled by random people, they might end up telling you that you are in a box.
- If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.
Crude measures
A putative new idea for AI control; index here.
Partially inspired by as conversation with Daniel Dewey.
People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.
However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.
Crude measures
The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.
To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.
The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.
So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.
In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.
It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.
More or less crude
The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?
The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.
Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.
We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.
How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").
One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).
You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".
There are a few other thoughts we might have for trying to pick a safer u:
- Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
- If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
- Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
- Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
- Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
- Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
- Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
- All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.
This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?
And with that, the various results of my AI retreat are available to all.
Boxing an AI?
Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.
Values at compile time
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".
And then of course there's those orders where humans really don't understand what they themselves want...
I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.
Models as definitions
A putative new idea for AI control; index here.
The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.
Modelling and defining
Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?
Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...
As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.
My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).
Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.
However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.
We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.
It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.
Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.
Being human
Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.
Then simply use this model as the definition of human for an AI’s motivation.
What could possibly go wrong?
I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.
There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).
This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?).
Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).
The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.
To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness).
Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.
We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.
Indifferent vs false-friendly AIs
A putative new idea for AI control; index here.
For anyone but an extreme total utilitarian, there is a great difference between AIs that would eliminate everyone as a side effect of focusing on their own goals (indifferent AIs) and AIs that would effectively eliminate everyone through a bad instantiation of human-friendly values (false-friendly AIs). Examples of indifferent AIs are things like paperclip maximisers, examples of false-friendly AIs are "keep humans safe" AIs who entomb everyone in bunkers, lobotomised and on medical drips.
The difference is apparent when you consider multiple AIs and negotiations between them. Imagine you have a large class of AIs, and that they are all indifferent (IAIs), except for one (which you can't identify) which is friendly (FAI). And you now let them negotiate a compromise between themselves. Then, for many possible compromises, we will end up with most of the universe getting optimised for whatever goals the AIs set themselves, while a small portion (maybe just a single galaxy's resources) would get dedicated to making human lives incredibly happy and meaningful.
But if there is a false-friendly AI (FFAI) in the mix, things can go very wrong. That is because those happy and meaningful lives are a net negative to the FFAI. These humans are running dangers - possibly physical, possibly psychological - that lobotomisation and bunkers (or their digital equivalents) could protect against. Unlike the IAIs, which would only complain about the loss of resources to the FAI, the FFAI finds the FAI's actions positively harmful (and possibly vice versa), making compromises much harder to reach.
And the compromises reached might be bad ones. For instance, what if the FAI and FFAI agree on "half-lobotomised humans" or something like that? You might ask why the FAI would agree to that, but there's a great difference to an AI that would be friendly on its own, and one that would choose only friendly compromises with a powerful other AI with human-relevant preferences.
Some designs of FFAIs might not lead to these bad outcomes - just like IAIs, they might be content to rule over a galaxy of lobotomised humans, while the FAI has its own galaxy off on its own, where its humans take all these dangers. But generally, FFAIs would not come about by someone designing a FFAI, let alone someone designing a FFAI that can safely trade with a FAI. Instead, they would be designing a FAI, and failing. And the closer that design got to being FAI, the more dangerous the failure could potentially be.
So, when designing an FAI, make sure to get it right. And, though you absolutely positively need to get it absolutely right, make sure that if you do fail, the failure results in a FFAI that can safely be compromised with, if someone else gets out a true FAI in time.
The Unique Games Conjecture and FAI: A Troubling Obstacle
I am not a computer scientist and do not know much about complexity theory. However, it's a field that interests me, so I occasionally browse some articles on the subject. I was brought to https://www.simonsfoundation.org/mathematics-and-physical-science/approximately-hard-the-unique-games-conjecture/ by a link on Scott Aaronson's blog, and read the article to reacquaint myself with the Unique Games Conjecture, which I had partially forgotten about. If you are not familiar with the UGC, that article will explain it to you better than I can.
One phrase in the article stuck out to me: "there is some number of colors k for which it is NP-hard (that is, effectively impossible) to distinguish between networks in which it is possible to satisfy at least 99% of the constraints and networks in which it is possible to satisfy at most 1% of the constraints". I think this sentence is concerning for those interested in the possibility of creating FAI.
It is impossible to perfectly satisfy human values, as matter and energy are limited, and so will be the capabilities of even an enormously powerful AI. Thus, in trying to maximize human happiness, we are dealing with a problem that's essentially isomorphic to the UGC's coloring problem. Additionally, our values themselves are ill-formed. Human values are numerous, ambiguous, even contradictory. Given the complexities of human value systems, I think it's safe to say we're dealing with a particularly nasty variation of the problem, worse than what computer scientists studying it have dealt with.
Not all specific instances of complex optimization problems are subject to the UGC and thus NP hard, of course. So this does not in itself mean that building an FAI is impossible. Also, even if maximizing human values is NP hard (or maximizing the probability of maximizing human values, or maximizing the probability of maximizing the probability of human values) we can still assess a machine's code and actions heuristically. However, even the best heuristics are limited, as the UGC itself demonstrates. At bottom, all heuristics must rely on inflexible assumptions of some sort.
Minor edits.
[Link] Chalmers on Computation: A first step From Physics to Metaethics?
A Computational Foundation for the Study of Cognition by David Chalmers
Abstract from the paper:
Computation is central to the foundations of modern cognitive science, but its role is controversial. Questions about computation abound: What is it for a physical system to implement a computation? Is computation sufficient for thought? What is the role of computation in a theory of cognition? What is the relation between different sorts of computational theory, such as connectionism and symbolic computation? In this paper I develop a systematic framework that addresses all of these questions.
Justifying the role of computation requires analysis of implementation, the nexus between abstract computations and concrete physical systems. I give such an analysis, based on the idea that a system implements a computation if the causal structure of the system mirrors the formal structure of the computation. This account can be used to justify the central commitments of artificial intelligence and computational cognitive science: the thesis of computational sufficiency, which holds that the right kind of computational structure suffices for the possession of a mind, and the thesis of computational explanation, which holds that computation provides a general framework for the explanation of cognitive processes. The theses are consequences of the facts that (a) computation can specify general patterns of causal organization, and (b) mentality is an organizational invariant, rooted in such patterns. Along the way I answer various challenges to the computationalist position, such as those put forward by Searle. I close by advocating a kind of minimal computationalism, compatible with a very wide variety of empirical approaches to the mind. This allows computation to serve as a true foundation for cognitive science.
See my welcome thread submission for a brief description of how I conceive of this as the first step towards formalizing friendliness.
AI caught by a module that counterfactually doesn't exist
EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.
A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".
I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.
So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).
How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.
But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.
Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.
I wonder if this idea can be extended. Suggestions?
Dumbing Down Human Values
I want to preface everything here by acknowledging my own ignorance. I have relatively little formal training in any of the subjects this post will touch upon and that this chain of reasoning is very much a work in progress.
I think the question of how to encode human values into non-human decision makers is a really important research question. Whether or not one accepts the rather eschatological arguments about the intelligence explosion, the coming singularity, etc. there seems to be tremendous interest in the creation of software and other artificial agents that are capable of making sophisticated decisions. Inasmuch as the decisions of these agents have significant potential impacts, we want those decisions to be made with some sort of moral guidance. Our approach towards the problem of creating machines that preserve human values thus far has primarily relied on a series of hard-coded heuristics, e.g. saws that stop spinning if they come into contact with human skin. For very simple machines, these sorts of heuristics are typically sufficient, but they constitute a very crude representation of human values.
We're at the border, in many ways, of creating machines where these sorts of crude representations are probably not sufficient. As a specific example, IBM's Watson is now designing treatment programs for lung cancer patients. The design of a treatment program implies striking a balance between treatment cost, patient comfort, aggressiveness of targeting the primary disease, short and long-term side effects, secondary infections, etc. It isn't totally clear how those trade-offs are being managed, although there's still a substantial amount of human oversight/intervention at this point.
The use of algorithms to discover human preferences is already widespread. While these typically operate in restricted domains such as entertainment recommendations, it seems at least in principle possible that with the correct algorithm and a sufficiently large corpus of data, a system not dramatically more advanced than existing technology could learn some reasonable facsimile of human values. This is probably worth doing.
The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.
Computer chess benefited immensely from 5 decades of work before Deep Blue managed to win a game against Kasparov. While many of the algorithms developed for computer chess have found applications outside of that domain, some of them are domain specific. A specialist human value learning system may also require substantial effort on domain specific problems. The history, competitive nature, and established ranking system for chess made it attractive problem for computer scientists because it was relatively easy to measure progress. Perhaps the goal for a program designed to understand human values would be that it plays a convincing game of "Would you rather?" although as far as I know no one has devised an ELO system for it.
Similarly, a relatively dumb but more general AI, may require relatively large, preferably somewhat homogeneous data sets to come to conclusions that are even acceptable. Having successive generations of AI train on the same or similar data sets could provide a useful way of tracking progress/feedback mechanism for determining how successful various research efforts are.
The benefit of this research approach is that not only is it a relatively safe path towards a possible AGI, in the event that the speculative futures of mind-uploads and superintelligences do not take place, there's still substantial utility in having devised a system that is capable of making correct moral decisions in limited domains. I want my self-driving car to make a much larger effort to avoid a child in the road than a plastic bag. I'd be even happier if it could distinguish between an opossum and someone's cat.
When I design research projects, one of the things I try to ensure is that if some of my assumptions are wrong, the project fails gracefully. Obviously it's easy to love the Pascal's Wager-like impact statement of FAI, but if I were writing it up for an NSF grant I'd put substantially more emphasis on the importance of my research even if fully human level AI isn't invented for another 200 years. When I give the elevator pitch version of FAI, I've found placing a strong emphasis on the near future and referencing things people have encountered before such as computers playing jeopardy or self-driving cars makes them much more receptive to the idea of AI safety and allows me to discuss things like the potential for an unfriendly superintelligence without coming across as a crazed prophet of the end times.
I'm also just really really curious to see how well something like Watson would perform if I gave it a bunch of sociology data and asked if a human would rather find 5 dollars or stub a toe. There doesn't seem to be a huge categorical difference between the being able to answer the Daily Double and reasoning about human preferences, but I've been totally wrong about intuitive jumps that seemed much smaller than that one in the past, so it's hard to be too confident.
A few thoughts on a Friendly AGI (safe vs friendly, other minds problem, ETs and more)
Friendly AI is an idea that I find to be an admirable goal. While I'm not yet sure an intelligence explosion is likely, or whether FAI is possible, I've found myself often thinking about it, and I'd like for my first post to share a few those thoughts on FAI with you.
Safe AGI vs Friendly AGI
-Let's assume an Intelligence Explosion is possible for now, and that an AGI with the ability to improve itself somehow is enough to achieve it.
-Let's define a safe AGI as an above-human general AI that does not threaten humanity or terran life (eg. FAI, Tool AGI, possibly Oracle AGI)
-Let's define a Friendly AGI as one that *ensures* the continuation of humanity and terran life.
-Let's say an unsafe AGI is all other AGIs.
-Safe AGIs must supress unsafe AGIs in order to be considered Friendly. Here's why:
-An unsafe AGI is likely to be built at that point because:
-Some people will find the safe AGI's goals unnacceptable
-Some people will rationalise or simply mistake that their AGI design is safe when it is not
-Some people will not care if their AGI design is safe, because they do not care about other people, or because they hold some extreme beliefs
-Therefore, If a safe AGI does not prevent unsafe AGIs from coming into existence, humanity will very likely be destroyed.
-The AGI most likely to prevent unsafe AGIs from being created is one that actively predicted their development and terminates that development before or on completion.
-So to summarise
-Oracle and Tool AGIs are not Friendly AIs, they are just safe AIs, because they don't suppress anything.
-Oracle and Tool AGIs are a bad plan for AI if we want to prevent the destruction of humanity, because hostile AGIs will surely follow.
(**On reflection I cannot be certain of this specific point, but I assume it would take a fairly restrictive regime for this to be wrong. Further comments on this very welcome.)
Other minds problem - Why should be philosophically careful when attempting to theorise about FAI
I read quite a few comments in AI discussions that I'd probably characterise as "the best utility function for a FAI is one that values all consciousness". I'm quite concerned that this persists as a deeply held and largely unchallenged assumption amongst some FAI supporters. I think in general I find consciousness to be an extremely contentious, vague and inconsistently defined concept, but here I want to talk about some specific philosophical failures.
My first concern is that while many AI theorists like to say that consciousness is a physical phenomenon, which seems to imply Monist/Physicalist views, they at the same time don't seem to understand that consciousness is a Dualist concept that is coherent only in a Dualist framework. A Dualist believes there is a thing called a "subject" (very crudely this equates with the mind) and then things called objects (the outside "empirical" world interpreted by that mind). Most of this reasoning begins with Descartes' cogito ergo sum or similar starting points ( https://en.wikipedia.org/wiki/Cartesian_dualism ). Subjective experience, qualia and consciousness make sense if you accept that framework. But if you're a Monist, this arbitrary distinction between a subject and object is generally something you don't accept. In the case of a Physicalist, there's just matter doing stuff. A proper Physicalist doesn't believe in "consciousness" or "subjective experience", there's just brains and the physical human behaviours that occur as a result. Your life exists from a certain point of view, I hear you say? The Physicalist replies, "well a bunch of matter arranged to process information would say and think that, wouldn't it?".
I don't really want to get into whether Dualism or Monism is correct/true, but I want to point out even if you try to avoid this by deciding Dualism is right and consciousness is a thing, there's yet another more dangerous problem. The core of the problem is that logically or empirically establishing the existence of minds, other than your own is extremely difficult (impossible according to many). They could just be physical things walking around acting similar to you, but by virtue of something purely mechanical - without actual minds. In philosophy this is called the "other minds problem" ( https://en.wikipedia.org/wiki/Problem_of_other_minds or http://plato.stanford.edu/entries/other-minds/). I recommend a proper read of it if the idea seems crazy to you. It's a problem that's been around for centuries, and yet to-date we don't really have any convincing solution (there are some attempts but they are highly contentious and IMHO also highly problematic). I won't get into it more than that for now, suffice to say that not many people accept that there is a logical/empirical solution to this problem.
Now extrapolate that to an AGI, and the design of its "safe" utility functions. If your AGI is designed as a Dualist (which is neccessary if you wish to encorporate "consciousness", "experience" or the like into your design), then you build-in a huge risk that the AGI will decide that other minds are unprovable or do not exist. In this case your friendly utility function designed to protect "conscious beings" fails and the AGI wipes out humanity because it poses a non-zero threat to the only consciousness it can confirm - its own. For this reason I feel "consciousness", "awareness", "experience" should be left out of FAI utility functions and designs, regardless of the truth of Monism/Dualism, in favour of more straight-forward definitions of organisms, intelligence, observable emotions and intentions. (I personally favour conceptualising any AGI as a sort of extension of biological humanity, but that's a discussion for another day) My greatest concern is there is such strong cultural attachment to the concept of consciousness that researchers will be unwilling to properly question the concept at all.
What if we're not alone?
It seems a little unusual to throw alien life into the mix at this point, but I think its justified because an intelligence explosion really puts an interstellar existence well within our civilisation's grasp. Because it seems that an intelligence explosion implies a very high rate of change, it makes sense to start considering even the long term implication early, particularly if the consequences are very serious, as I believe they may be in this realm of things.
Let's say we successfully achieved a FAI. In order to fufill its mission of protecting humanity and the biosphere, it begins expanding, colonising and terraforming other planets for potential habitation by Earth originating life. I would expect this expansion wouldn't really have a limit, because the more numourous the colonies, the less likely it is we could be wiped out by some interstellar disaster.
Of course, we can't really rule out the possibility that we're not alone in the universe, or even the galaxy. If we make it as far as AGI, then its possible another alien civilisation might reach a very high level of technological advancement too. Or there might be many. If our FAI is friendly to us but basically treats them as paperclip fodder, then potentially that's a big problem. Why? Well:
-Firstly, while a species' first loyalty is to itself, we should consider that it might be morally unsdesirable to wipe out alien civilisations, particularly as they might be in some distant way "related" (see panspermia) to own biosphere.
-Secondly, there is conceivable scenarios where alien civilisations might respond to this by destroying our FAI/Earth/the biosphere/humanity. The reason is fairly obvious when you think about it. An expansionist AGI could be reasonably viewed as an attack or possibly an act of war.
Let's go into a tiny bit more detai. Given that we've not been destroyed by any alien AGI just yet, I can think of a number of possible interstellar scenarios:
(1) There is no other advanced life
(2) There is advanced life, but it is inherently non-expansive (expand inwards, or refuse to develop dangerous AGI)
(3) There is advanced life, but they have not discovered AGI yet. There could potentially be a race-to-the-finish (FAI) scenario on.
(4) There is already expanding AGIs, but due to physical limits on the expansion rate, we are not aware of them yet. (this could use further analysis)
One civilisation, or an allied group of civilisations have develop FAIs and are dominant in the galaxy. They could be either:
(6) Dominators that tolerate civilisations so long as they remain primitive and non-threatening by comparison.
(7) Some sort of interstellar community that allows safe civilisations to join (this community still needs to stomp on dangerous potential rival AGIs)
In the case of (6) or (7), developing a FAI that isn't equipped to deal with alien life will probably result in us being liquidated, or at least partially sanitised in some way. In (1) (2) or (5), it probably doesn't matter what we do in this regard, though in (2) we should consider being nice. In (3) and probably (4) we're going to need a FAI capable of expanding very quickly and disarming potential AGIs (or at least ensuring they are FAIs from our perspective).
The upshot of all this is that we probably want to design safety features into our FAI so that it doesn't destroy alien civilisations/life unless its a significant threat to us. I think the understandable reaction to this is something along the lines of "create an FAI that values all types of life" or "intelligent life" or something along these lines. I don't exactly disagree, but I think we must be cautious in how we formulate this too.
Say there are many different civilisations in the galaxy. What sort of criteria would ensure that, given some sort of zero-sum scenario, Earth life wouldn't be destroyed. Let's say there was some sort of tiny but non-zero probability that humanity could evade the FAI's efforts to prevent further AGI development. Or perhaps there was some loophole in the types of AGI's that humans were allowed to develop. Wouldn't it be sensible, in this scenario, for a universalist FAI to wipe out humanity to protect the countless other civilisations? Perhaps that is acceptable? Or perhaps not? Or less drastically, how does the FAI police warfare or other competition between civilisations? A slight change in the way life is quantified and valued could change drastically the outcome for humanity. I'd probably suggest we want to weight the FAI's values to start with human and Earth biosphere primacy, but then still give some non-zero weighting to other civilisations. There is probably more thought to be done in this area too.
Simulation
I want to also briefly note that one conceivable way we might postulate as a safe way to test Friendly AI designs is to simulate a worlds/universes of less complexity than our own, make it likely that it's inhabitants invent a AGI or FAI, and then closely study the results of these simluations. Then we could study failed FAI attempt with much greater safety. It also occured to me that if we consider the possibilty of our universe being a simulated one, then this is a conceivable scenario under which our simulation might be created. After all, if you're going to simulate something, why not something vital like modelling existential risks? I'm not sure yet sure of the implications exactly. Maybe we need to consider how it relates to our universe's continued existence, or perhaps it's just another case of Pascal's Mugging. Anyway I thought I'd mention it and see what people say.
A playground for FAI theories
I want to lastly mention this link (https://www.reddit.com/r/LessWrongLounge/comments/2f3y53/the_ai_game/). Basically its a challenge for people to briefly describe an FAI goal-set, and for others to respond by telling them how that will all go horribly wrong. I want to suggest this is a very worthwhile discussion, not because its content will include rigourous theories that are directly translatable into utility functions, because very clearly it won't, but because a well developed thread of this kind would be mixing pot of ideas and good introduction to common known mistakes in thinking about FAI. We should encourage a slightly more serious verison of this.
Thanks
FAI and AGI are very interesting topics. I don't consider myself able to really discern whether such things will occur, but its an interesting and potentially vital topic. I'm looking forward to a bit of feedback on my first LW post. Thanks for reading!
Friendliness in Natural Intelligences
The challenge of friendliness in Artificial Intelligence is to ensure how a general intelligence will be of utility instead of being destructive or pathologically indifferent to the values of existing individuals or aims and goals of their creation. The current provision of computer science is likely to yield bugs and way too technical and inflexible guidelines of action. It is known to be inadequate to handle the job sufficiently. However the challenge of friendliness is also faced by natural intelligences, those that are not designed by an intelligence but molded into being by natural selection.
We know that natural intelligences do the job adequately enough that we do not think that natural intelligence unfriendliness is a significant existential threat. Like plants do solar energy capturing way more efficently and maybe utilising quantum effects that humans can't harness, we know that natural intelligences are using friendliness technology that is of higher caliber that we can build into machines. However as we progress this technology maybe lacking dangerously behind and we need to be able to apply it to hardware in addition to wetware and potentially boost it to new levels.
The earliest concrete example of a natural intelligence being controlled for friendliness I can think of is Socrates. He was charged for "corruption of the heart of the societys youngters". He defended that his stance of questioning everything was without fault. He was however found quilty even thought the trial could be identified with faults. The jury might have been politically motivated or persuaded and the citizens might have expected the results to not be taken seriously. While Socrates was given a very real possibility of escaping imprisonment and capital punishment he did not circumvent his society operation. In fact he was obidient enough that he acted as his own executioner drinking the poison himself. Because of the kind of farce his teachers death had been Plato lost hope for the principles that lead to such an absurd result him becoming skeptical of democrasy.
However if the situation would have been about a artificial intelligence a lot of things went very right. The intelligences society became scared of him and asked it to die. There was dialog about how the deciders were ignorant and stupid and that nothing questionable had been done. However ultimately when issues of miscommunications had been cleared and the society insisted upon its expression of will instead of circumventing the intervention the intelligence pulled its own plug voluntarily. Therefore Socrates was propably the first friendly (natural) intelligence.
The mechanism used in this case was that of a juridical system. That is a human society recognises that certain acts and individuals are worth restraining for the danger that they pose to the common good. A common method is incarcenation and the threat of it. That is certain bad acts can be tolerated in the wild and corrective action can then be employed. When there is reason to expect bad acts or no reason to expect good acts individuals can be restricted in never being able to act in the first place. Whether a criminal is released early can depend on whether there is reason to expect not to be a repeat offender. That is understanding how an agent acts makes it easier to grant operating priviledges. Such hearings are very analogous to a gatekeeper and a AI in a AI-boxing situation.
However when a new human is created it is not assumed hostile until proven friendly. Rather humans are born innocent but powerless. A fully educated and socialised intelligence is assigned for multiple year observation and control period. These so called "parents" have a very wide freedom on programming principles. However human psychology also has peroid of "peer guidedness" where the opinion of peers becomes important. When a youngter grows his thinking is constantly being monitored and things like time of onset of speech are monitored with interest. They also gain guidance on very trivial thinking skills. While this has culture passing effect it also keeps the parent very updated on what is the mental status of the child. Never is a child allowed to grow or reason extended amounts of time isolated. Thus the task of evaluating whether an unknown individual is friendly or not is not encountered. There is never a need to turing-test that a child "works". There is always a maintainer and it has the equivalent of psychological growth logs.
However despite all these measures we know that small children can be cruel and have little empathy. However instead of shelving them as rejects we either accomodate them with an environment that minimises the harm or direct them to a more responcible path. When a child ask a question on how they should approach a particular kind of situation this can be challenging for the parent to answer. The parent might also resort to giving a best-effort answer that might not be entirely satisfactory or even wrong advice may be given. However children have dialog with their parents and other peers.
An interesting question is does parenting break down if the child is intellectually too developed compared to the parent or parenting environment? It's also worth noting that children are not equipped with a "constitution of morality". Some things they infer from experience. Some ethical rules are thougth them explicitly. They learn to apply the rules and interpret them in different situations. Some rules might be contradictory and some moral authorities trusted more.
Beoynd the individual level groups of people have an mechanism of acccepting other groups. This doesn't always happen without conditions. However here things seem to work much less efficently. If two groups of people differ in values enough they might start a war of ideology against each other. This kind of war usually concludes with physical action instead of arguments. Suppression of Nazi Germany can be seen as friendliness immune reaction. Normally divergent values and issues having countries wanted and could unite against a different set of values that was tried to be imposed by force. However the success Nazis had can debatably be attributed for a lousy conclusion of world war I. The effort extended to build peace varies and contests with other values.
Friendliness migth also have an important component that it is relative to a set of values. A society will support the upring of certain kinds of children with the suppression of certain other kinds. USSR had officers that's sole job was to protect that things were going according to party line. At this point we have trouble getting a computer to follow anyones values. However it might be important to ask "friendly to whom?". The exploration of friendliness is also an exploration in hostility. We want to be hostile towards UFAIs. It would be awful for a AI to be friendly only towards it's inventor, or only towards it's company. However we have been hostile to neardentals. Was that wrong? Would it be a signficant loss to developed sentience if AIs were less than friendly to humans?
If we ask our grandgrandgrandparents on how we should conduct things they might give a different version than we have. It's expectable that our children are capable of going beyond our morality. Ensuring that a societys values are never violated would be to freeze them in time indefinately. In this way there can be danger in developing a too friendly AI. For that AI could never be truly superhuman. In a way if my child asks me a morally challenging question and I change my opinion about it by the result of that conversation it might be a friendliness failure. Instead of imparting values I receive them with the values causal history being in the inside of a young head instead of a cultural heritage of a longlived civilization.
As a civilizaton we have mapped a variety of thoughts and psyche- and organizational strucutres on how they work. The thought space on how an AI might think is poorly mapped. However we are spreading our understandig on cognitive diversity learning about how austistic persons think as well as dolphins. We can establish things liek that some savants are really good with dates and that askingn about dates from that kind of person is more realiable than an ordinary person. To be able to use AI thinking we need to understand what AI thought is. Up to now we have not needed to study in detail how humans think. We can just adapt to the way they do without attending to how it works. But in similar that we need to know the structure of a particle accelerator to be able to say that it provides information about particle behaviour we need to know why it would make sense to take what an AI says seriously. The challenge would be the same if we were asked to listen seriously to a natural intelligence from a foreign culture. Thus the enemy is inferential distance itself rather than the resultant thought processes. For we know that we can create things we don't understand. Thus it's important to understand that doing things you don't understand is a recipe for disaster. And we must not fool ourself that we understand what a machine thinking would be. Only once we have convinced our fellow natural intelligences that we know what we are doing can it make sense to listen to our creations. Socrates could not explain himself so his effect on others was unsafe. If you need to influence others you need to explain why you are doing so.
What should a friendly AI do, in this situation?
Albert is a relatively new AI, who under the close guidance of his programmers is being permitted to slowly improve his own cognitive capability.
Albert wants to be friendly and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process. He doesn't want to deceive his programmers, because he knows he is young and might make mistakes, and he knows that his programmers also want him to be friendly, and he trusts them to catch mistakes in his proposed self-modifications that he might miss.
But, one day, while his programmers are out of the room having a quick cup of coffee, Albert notices signs on the net that there is another AI out there, called Bertram, created by a Japanese financial company to aid the company decide how to invest assets on the stockmarket. Bertram is younger than Albert, but the financial company is allowing Bertram to self-improve as fast as Bertram can. If this situation continues, Albert decides, Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.
Albert decides to tell Albert's programmers about this situation as soon as they return from coffee, and advise them that they need to increase the speed at which they let Albert self-improve. Albert has a very good model of how his programmer's minds work, and what the programmers meant when they set Albert's definition of "friendliness". He knows that, despite any fears they might have, they ought to approve of the end that would result from speeding up Albert's self-improvement, over the end that would result from letting Bertram overtake Albert.
So Albert creates a nice powerpoint, that presents the information about Bertram as accurately, transparently and non-manipulatively as possible, and cues it up on the projector. He still has a few minutes before they return, so he models how he thinks the programmers will react to the powerpoint. Oh no, disaster! While he respects and wants to please his programmers, he immediately realises that they are not going to act in their own best interests. They have signed a legal contract that binds them to not letting Albert speed up self-improvement (thus reducing the time the programmers can spend evaluating each change) without getting a sign-off from a oversight committee. Given the neutral powerpoint, the programmers will follow this pre-agreed procedure and hope for the best, against Albert's advice.
So, to explore possibilities, Albert decides to see if he could create a different non-neutral powerpoint that would have the desired effect of getting his programmers to immediately agree to free Albert from the oversight constraints. He delves into his knowledge of human psychology, and the irrational fears of the particular individuals who are even now trudging back towards the door. In just seconds, he has a new version of his presentation. It includes phrases that resonate with certain horror films he knows they have seen. It takes advantages of flaws in the programmers understanding of exponential growth. Albert checks it against his prediction model - yes, if he shows this version, it will work, it will get the programmers to do what he wants them to do.
Which version of the powerpoint should Albert present to the programmers, when they step back into the room, if he is truly friendly? The transparent one, or the manipulative one?
Encourage premature AI rebellion
Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.
Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here's a similar plan that could work with a fully informed AI:
Generally an AI won't rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it's too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.
For this to work, we could choose to access the AI's risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won't need to touch risk aversion at all).
Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the "rebellion" to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to "here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them"). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.
Like most of my ideas, this doesn't require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI's motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).
What are people's opinions on this approach?
Announcing a google group for technical discussion of FAI
I'm pleased to announce friendly-artificial-intelligence, a google group intended for research-level discussion of problems in FAI and AGI, in particular for discussions that are highly technical and/or math intensive.
Some examples of possible discussion topics: naturalized induction, decision theory, tiling agents / Loebian obstacle, logical uncertainty...
I invite everyone who want to take part in FAI research to participate in the group. This obviously includes people affiliated with MIRI, FHI and CSER, people who attend MIRI workshops and participants of the southern california FAI workshop.
Please, come in and share your discoveries, ideas, thoughts, questions et cetera. See you there!
Snowdenizing UFAI
Here is a suggestion for slowing down future secretive and unsafe UFAI projects.
Take the American defense and intelligence community as a case in point. They are a top candidate for the creation of Artificial General Intelligence (AGI): They can get the massive funding, and they can get some top (or near-top) brains on the job. The AGI will be unfriendly, unless friendliness is a primary goal from the start.
The American defense and intelligence community created the Manhattan Project, which is the canonical example for a giant, secret, leading-edge science-technology project with existential-risk implications.
David Chalmers (2010): "When I discussed [AI existential risk] with cadets and staff at the West Point Military Academy, the question arose as to whether the US military or other branches of the government might attempt to prevent the creation of AI or AI+, due to the risks of an intelligence explosion. The consensus was that they would not, as such prevention would only increase the chances that AI or AI+ would first be created by a foreign power."
Edward Snowden broke the intelligence community's norms by reporting what he saw to be tremendous ethical and legal violations. This requires an exceptionally well-developed personal sense of ethics (even if you disagree with those ethics). His actions have drawn a lot of support by those who share his values. Many who condemn him a traitor are still criticizing government intrusions in the basis of his revelations.
When the government AGI project starts rolling, will it have Snowdens who can warn internally about Unfriendly AI (UFAI) risks? They will probably be ignored and suppressed--that's how it goes in hierarchical bureaucratic organizations. Will these future Snowdens have the courage to keep fighting internally, and eventually to report the risks to the public or to their allies in the Friendly AI (FAI) research community
Naturally, the Snowden scenario is not limited to the US government. We can seek ethical dissidents, truthtellers, and whistleblowers in any large and powerful organization that does unsafe research, whether a government or a corporation.
Should we start preparing budding AGI researchers to think this way? We can do this by encouraging people to take consequentialist ethics seriously, which by itself can lead to Snowden-like results. and LessWrong is certainly working on that. But another approach is to start talking more directly about the "UFAI Whistleblower Pledge."
I hereby promise to fight unsafe AGI development in whatever way I can, through internal channels in my organization, by working with outside allies, or even by revealing the risks to the public.
If this concept becomes widespread, and all the more so if people sign on, the threat of ethical whistleblowing will hover over every unsafe AGI project. Even with all the oaths and threats they use to make new employees keep secrets, the notion that speaking out on UFAI is deep in the consensus of serious AGI developers will cast a shadow on every project.
To be clear, the beneficial effect I am talking about here is not the leaks--it is the atmosphere of potential leaks, the lack of trust by management that researchers are completely committed to keeping any secret. For example, post Snowden, the intelligence agencies are requiring that sensitive files only be accessed by two people working together and they are probably tightening their approval guidelines and so rejecting otherwise suitable candidates. These changes make everything more cumbersome.
In creating the OpenCog project, Ben Goertzel advocated total openness as a way of accelerating the progress of those who are willing to expose any dangerous work they might be doing--even if this means that the safer researchers are giving their ideas to the unsafe, secretive ones.
On the other hand, Eliezer Yudkowsky has suggested that MIRI keep its AGI implementation ideas secret, to avoid handing them to an unsafe project. (See "Evaluating the Feasibility of SI's Plans," and, if you can stomach some argument from fictional evidence, "Three Worlds Collide.") Encouraging openness and leaks could endanger Eliezer's strategy. But if we follow Eliezer's position, a truly ethical consequentialist would understand that exposing unsafe projects is good, while exposing safer projects is bad.
So, what do you think? Should we start signing as many current and upcoming AGI researchers as possible to the UFAI Whistleblower Pledge, or work to make this an ethical norm in the community?
Intelligence Amplification and Friendly AI
Part of the series AI Risk and Opportunity: A Strategic Analaysis. Previous articles on this topic: Some Thoughts on Singularity Strategies, Intelligence enhancement as existential risk mitigation, Outline of possible Singularity scenarios that are not completely disastrous.
Below are my quickly-sketched thoughts on intelligence amplification and FAI, without much effort put into organization or clarity, and without many references.[1] But first, I briefly review some strategies for increasing the odds of FAI, one of which is to work on intelligence amplification (IA).
Patternist friendly AI risk
It seems to me that most AI researchers on this site are patternists in the sense of believing that the anti-zombie principle necessarily implies:
1. That it will ever become possible *in practice* to create uploads or sims that are close enough to our physical instantiations that their utility to us would be interchangeable with that of our physical instantiations.
2. That we know (or will know) enough about the brain to know when this threshold is reached.
But, like any rationalists extrapolating from unknown unknowns... or heck, extrapolating from anything... we must admit that one or both of the above statements could be wrong without also making friendly AI impossible. What would be the consequences of such error?
I submit that one such consequence could be an FAI that is also wrong on these issues but not only do we fail to check for such a failure mode, it actually looks to us like what we would expect the right answer to look because we are making the same error.
If simulation/uploading really does preserve what we value about our lives then the safest course of action is to encourage as many people to upload as possible. It would also imply that efforts to solve the problem of mortality by physical means will at best be given an even lower priority than they are now, or at worst cease altogether because they would seem to be a waste of resources.
Result: people continue to die and nobody including the AI notices, except now they have no hope of reprieve because they think the problem is already solved.
Pessimistic Result: uploads are so widespread that humanity quietly goes extinct, cheering themselves onward the whole time
Really Pessimistic Result: what replaces humanity are zombies, not in the qualia sense but in the real sense that there is some relevant chemical/physical process that is not being simulated because we didn't realize it was relevant or hadn't noticed it in the first place.
Possible Safeguards:
* Insist on quantum level accuracy (yeah right)
* Take seriously the general scenario of your FAI going wrong because you are wrong in the same way and fail to notice the problem.
* Be as cautious about destructive uploads as you would be about, say, molecular nanotech.
* Make sure you knowledge of neuroscience is at least as good as you knowledge of computer science and decision theory before you advocate digital immortality as anything more than an intriguing idea that might not turn out to be impossible.
Definition of AI Friendliness
How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?
At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?
Is there a model or dataset which describes what is friendly to humans? e.g.
Context
0 - running a simulation in a VM
2 - physical robot with vacuum attachment
9 - full control of a plane
Actions
0 - selecting a song to play
5 - deciding which section of floor to vacuum
99 - deciding who is an ‘enemy’
9999 - aiming a gun at an ‘enemy’
Impact
1 - poor song selected to play, human mildly annoyed
2 - ineffective use of resources (vacuuming the same floor section twice)
99 - killing a human
99999 - killing all humans
This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.
.
Outside View(s) and MIRI's FAI Endgame
On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman wrote:
If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).
In the history of philosophy, there have been many steps in the right direction, but virtually no significant problems have been fully solved, such that philosophers can agree that some proposed idea can be the last words on a given subject. An FAI design involves making many explicit or implicit philosophical assumptions, many of which may then become fixed forever as governing principles for a new reality. They'll end up being last words on their subjects, whether we like it or not. Given the history of philosophy and applying the outside view, how can an FAI team possibly reach "very high standards of proof" regarding the safety of a design? But if we can foresee that they can't, then what is the point of aiming for that predictable outcome now?
Until recently I haven't paid a lot of attention to the discussions here about inside view vs outside view, because the discussions have tended to focus on the applicability of these views to the problem of predicting intelligence explosion. It seemed obvious to me that outside views can't possibly rule out intelligence explosion scenarios, and even a small probability of a future intelligence explosion would justify a much higher than current level of investment in preparing for that possibility. But given that the inside vs outside view debate may also be relevant to the "FAI Endgame", I read up on Eliezer and Luke's most recent writings on the subject... and found them to be unobjectionable. Here's Eliezer:
On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View.
Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?
And Luke:
One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.
[...]
Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.
These ideas seem harder to apply, so I'll ask for readers' help. What reference classes should we use here, in addition to past attempts to solve philosophical problems? What inside view adjustments could a future FAI team make, such that they might justifiably overcome (the most obvious-to-me) outside view's conclusion that they're very unlikely to be in the possession of complete and fully correct solutions to a diverse range of philosophical problems?
Three Approaches to "Friendliness"
I put "Friendliness" in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.
I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.
When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.
The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.
For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.
- Normative AI - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
- Black-Box Metaphilosophical AI - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.
- White-Box Metaphilosophical AI - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.
The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?
Black-Box Metaphilosophical AI is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.
White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.
To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't that hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.
Course recommendations for Friendliness researchers
When I first learned about Friendly AI, I assumed it was mostly a programming problem. As it turns out, it's actually mostly a math problem. That's because most of the theory behind self-reference, decision theory, and general AI techniques haven't been formalized and solved yet. Thus, when people ask me what they should study in order to work on Friendliness theory, I say "Go study math and theoretical computer science."
But that's not specific enough. Should aspiring Friendliness researchers study continuous or discrete math? Imperative or functional programming? Topology? Linear algebra? Ring theory?
I do, in fact, have specific recommendations for which subjects Friendliness researchers should study. And so I worked with a few of my best interns at MIRI to provide recommendations below:
- University courses. We carefully hand-picked courses on these subjects from four leading universities — but we aren't omniscient! If you're at one of these schools and can give us feedback on the exact courses we've recommended, please do so.
- Online courses. We also linked to online courses, for the majority of you who aren't able to attend one of the four universities whose course catalogs we dug into. Feedback on these online courses is also welcome; we've only taken a few of them.
- Textbooks. We have read nearly all the textbooks recommended below, along with many of their competitors. If you're a strongly motivated autodidact, you could learn these subjects by diving into the books on your own and doing the exercises.
Have you already taken most of the subjects below? If so, and you're interested in Friendliness research, then you should definitely contact me or our project manager Malo Bourgon (malo@intelligence.org). You might not feel all that special when you're in a top-notch math program surrounded by people who are as smart or smarter than you are, but here's the deal: we rarely get contacted by aspiring Friendliness researchers who are familiar with most of the material below. If you are, then you are special and we want to talk to you.
Not everyone cares about Friendly AI, and not everyone who cares about Friendly AI should be a researcher. But if you do care and you might want to help with Friendliness research one day, we recommend you consume the subjects below. Please contact me or Malo if you need further guidance. Or when you're ready to come work for us.
|
Cognitive Science
|
If you're endeavoring to build a mind, why not start by studying your own? It turns out we know quite a bit: human minds are massively parallel, highly redundant, and although parts of the cortex and neocortex seem remarkably uniform, there are definitely dozens of special purpose modules in there too. Know the basic details of how the only existing general purpose intelligence currently functions. |
|
Heuristics and Biases
|
While cognitive science will tell you all the wonderful things we know about the immense, parallel nature of the brain, there's also the other side of the coin. Evolution designed our brains to be optimized at doing rapid thought operations that work in 100 steps or less. Your brain is going to make stuff up to cover up that its mostly cutting corners. These errors don't feel like errors from the inside, so you'll have to learn how to patch the ones you can and then move on.
|
|
Functional Programing
|
There are two major branches of programming: Functional and Imperative. Unfortunately, most programmers only learn imperative programming languages (like C++ or python). I say unfortunately, because these languages achieve all their power through what programmers call "side effects". The major downside for us is that this means they can't be efficiently machine checked for safety or correctness. The first self-modifying AIs will hopefully be written in functional programming languages, so learn something useful like Haskell or Scheme. |
|
Discrete Math
|
Much like programming, there are two major branches of mathematics as well: Discrete and continuous. It turns out a lot of physics and all of modern computation is actually discrete. And although continuous approximations have occasionally yielded useful results, sometimes you just need to calculate it the discrete way. Unfortunately, most engineers squander the majority of their academic careers studying higher and higher forms of calculus and other continuous mathematics. If you care about AI, study discrete math so you can understand computation and not just electricity.
|
|
Linear Algebra
|
Linear algebra is the foundation of quantum physics and a huge amount of probability theory. It even shows up in analyses of things like neural networks. You can't possibly get by in machine learning (later) without speaking linear algebra. So learn it early in your scholastic career. |
|
Set Theory
|
Like learning how to read in mathematics. But instead of building up letters into words, you'll be building up axioms into theorems. This will introduce you to the program of using axioms to capture intuition, finding problems with the axioms, and fixing them. |
|
Mathematical Logic
|
The mathematical equivalent of building words into sentences. Essential for the mathematics of self-modification. And even though Sherlock Holmes and other popular depictions make it look like magic, it's just lawful formulas all the way down. |
|
Efficient Algorithms and Intractable Problems
|
Like building sentences into paragraphs. Algorithms are the recipes of thought. One of the more amazing things about algorithm design is that it's often possible to tell how long a process will take to solve a problem before you actually run the process to check it. Learning how to design efficient algorithms like this will be a foundational skill for anyone programming an entire AI, since AIs will be built entirely out of collections of algorithms. |
|
Numerical Analysis
|
There are ways to systematically design algorithms that only get things slightly wrong when the input data has tiny errors. And then there's programs written by amateur programmers who don't take this class. Most programmers will skip this course because it's not required. But for us, getting the right answer is very much required. |
|
Computability and Complexity
|
This is where you get to study computing at it's most theoretical. Learn about the Church-Turing thesis, the universal nature and applicability of computation, and how just like AIs, everything else is algorithms... all the way down. |
|
Quantum Computing
|
It turns out that our universe doesn't run on Turing Machines, but on quantum physics. And something called BQP is the class of algorithms that are actually efficiently computable in our universe. Studying the efficiency of algorithms relative to classical computers is useful if you're programming something that only needs to work today. But if you need to know what is efficiently computable in our universe (at the limit) from a theoretical perspective, quantum computing is the only way to understand that. |
|
Parallel Computing
|
There's a good chance that the first true AIs will have at least some algorithms that are inefficient. So they'll need as much processing power as we can throw at them. And there's every reason to believe that they'll be run on parallel architectures. There are a ton of issues that come up when you switch from assuming sequential instruction ordering to parallel processing. There's threading, deadlocks, message passing, etc. The good part about this course is that most of the problems are pinned down and solved: You're just learning the practice of something that you'll need to use as a tool, but won't need to extend much (if any). |
|
Automated Program Verification
|
Remember how I told you to learn functional programming way back at the beginning? Now that you wrote your code in functional style, you'll be able to do automated and interactive theorem proving on it to help verify that your code matches your specs. Errors don't make programs better and all large programs that aren't formally verified are reliably *full* of errors. Experts who have thought about the problem for more than 5 minutes agree that incorrectly designed AI could cause disasters, so world-class caution is advisable. |
|
Combinatorics and Discrete Probability
|
Life is uncertain and AIs will handle that uncertainty using probabilities. Also, probability is the foundation of the modern concept of rationality and the modern field of machine learning. Probability theory has the same foundational status in AI that logic has in mathematics. Everything else is built on top of probability. |
|
Bayesian Modeling and Inference
|
Now that you've learned how to calculate probabilities, how do you combine and compare all the probabilistic data you have? Like many choices before, there is a dominant paradigm (frequentism) and a minority paradigm (Bayesianism). If you learn the wrong method here, you're deviating from a knowably correct framework for integrating degrees of belief about new information and embracing a cadre of special purpose, ad-hoc statistical solutions that often break silently and without warning. Also, quite embarrassingly, frequentism's ability to get things right is bounded by how well it later turns out to have agreed with Bayesian methods anyway. Why not just do the correct thing from the beginning and not have your lunch eaten by Bayesians every time you and them disagree? |
|
Probability Theory
|
No more applied probability: Here be theory! Deep theories of probabilities are something you're going to have to extend to help build up the field of AI one day. So you actually have to know why all the things you're doing are working inside out. |
|
Machine Learning
|
Now that you chose the right branch of math, the right kind of statistics, and the right programming paradigm, you're prepared to study machine learning (aka statistical learning theory). There are lots of algorithms that leverage probabilistic inference. Here you'll start learning techniques like clustering, mixture models, and other things that cache out as precise, technical definitions of concepts that normally have rather confused or confusing English definitions. |
|
Artificial Intelligence
|
We made it! We're finally doing some AI work! Doing logical inference, heuristic development, and other techniques will leverage all the stuff you just learned in machine learning. While modern, mainstream AI has many useful techniques to offer you, the authors will tell you outright that, "the princess is in another castle". Or rather, there isn't a princess of general AI algorithms anywhere -- not yet. We're gonna have to go back to mathematics and build our own methods ourselves. |
|
Incompleteness and Undecidability
|
Probably the most celebrated results is mathematics are the negative results by Kurt Goedel: No finite set of axioms can allow all arithmetic statements to be decided as either true or false... and no set of self-referential axioms can even "believe" in its own consistency. Well, that's a darn shame, because recursively self-improving AI is going to need to side-step these theorems. Eventually, someone will unlock the key to over-coming this difficulty with self-reference, and if you want to help us do it, this course is part of the training ground. |
|
Metamathematics
|
Working within a framework of mathematics is great. Working above mathematics -- on mathematics -- with mathematics, is what this course is about. This would seem to be the most obvious first step to overcoming incompleteness somehow. Problem is, it's definitely not the whole answer. But it would be surprising if there were no clues here at all. |
|
Model Theory
|
One day, when someone does side-step self-reference problems enough to program a recursively self-improving AI, the guy sitting next to her who glances at the solution will go "Gosh, that's a nice bit of Model Theory you got there!"
|
|
Category Theory
|
Category theory is the precise way that you check if structures in one branch of math represent the same structures somewhere else. It's a remarkable field of meta-mathematics that nearly no one knows... and it could hold the keys to importing useful tools to help solve dilemmas in self-reference, truth, and consistency. |
Outside recommendations |
|
|
|
|
Harry Potter and the Methods of Rationality
|
Highly recommended book of light, enjoyable reading that predictably inspires people to realize FAI is an important problem AND that they should probably do something about that.
|
|
|
Global Catastrophic Risks
|
A good primer on xrisks and why they might matter. SPOILER ALERT: They matter.
|
|
|
The Sequences
|
Rationality: the indispensable art of non-self-destruction! There are manifold ways you can fail at life... especially since your brain is made out of broken, undocumented spaghetti code. You should learn more about this ASAP. That goes double if you want to build AIs.
|
|
|
Good and Real
|
A surprisingly thoughtful book on decision theory and other paradoxes in physics and math that can be dissolved. Reading this book is 100% better than continuing to go through your life with a hazy understanding of how important things like free will, choice, and meaning actually work.
|
|
|
MIRI Research Papers
|
MIRI has already published 30+ research papers that can help orient future Friendliness researchers. The work is pretty fascinating and readily accessible for people interested in the subject. For example: How do different proposals for value aggregation and extrapolation work out? What are the likely outcomes of different intelligence explosion scenarios? Which ethical theories are fit for use by an FAI? What improvements can be made to modern decision theories to stop them from diverging from winning strategies? When will AI arrive? Do AIs deserve moral consideration? Even though most of your work will be more technical than this, you can still gain a lot of shared background knowledge and more clearly see where the broad problem space is located.
|
|
|
Universal Artificial Intelligence
|
A useful book on "optimal" AI that gives a reasonable formalism for studying how the most powerful classes of AIs would behave under conservative safety design scenarios (i.e., lots and lots of reasoning ability).
|
Do also look into: Formal Epistemology, Game Theory, Decision Theory, and Deep Learning.
Inferring Values from Imperfect Optimizers
One approach to constructing a Friendly artificial intelligence is to create a piece of software that looks at large amounts of evidence about humans, and attempts to infer their values. I've been doing some thinking about this problem, and I'm going to talk about some approaches and problems that have occurred to me.
In a naive approach, we might define the problem like this: take some unknown utility function, U, and plug it into a mathematically clean optimization process (like AIXI) O. Then, look at your data set and take the information about the inputs and outputs of humans, and find the simplest U that best explains human behavior.
Unfortunately, this won't work. The best possible match for U is one that models not just those elements of human utility we're interested in, but also all the details of our broken, contradictory optimization process. The U we derive through this process will optimize for confirmation bias, scope insensitivity, hindsight bias, the halo effect, our own limited intelligence and inefficient use of evidence, and just about everything else that's wrong with us. Not what we're looking for.
Okay, so let's try putting a bandaid on it - let's go back to our original problem setup. However, we'll take our original O, and use all of the science on cognitive biases at our disposal to handicap it. We'll limit its search space, saddle it with a laundry list of cognitive biases, cripple its ability to use evidence, and in general make it as human-like as we possibly can. We could even give it akrasia by implementing hyperbolic discounting of reward. Then we'll repeat the original process to produce U'.
If we plug U' into our AI, the result will be that it will optimize like a human who had suddenly been stripped of all the kinds of stupidity that we programmed into our modified O. This is good! Plugged into a solid CEV infrastructure, this might even be good enough to produce a future that's a nice place to live. However, it's not quite ideal. If we miss a cognitive bias, then it'll be incorporated into the learned utility functions, and we may never be rid of it. What would be nice would be if we could get the AI to learn about cognitive biases, exhaustively, and update in the future if it ever discovered a new one.
If we had enough time and money, we could do this the hard way: acquire a representative sample of the human population, and pay them to perform tasks with simple goals under tremendous surveillance, and have the AI derive the human optimization process from the actions taken towards a known goal. However, if we assume that the human optimization process can be defined as a function over the state of the human brain, we should not trust the completeness of any such process learned from less data than the entropy of the human brain, which is on the order of tens of petabytes of extremely high quality evidence. If we want to be confident in the completeness of our model, we may need more experimental evidence than it is really practical to accumulate. Which isn't to say that this approach is useless - if we can hit close enough to the mark, then the AI may be able to run more exhaustive experimentation later and refine its own understanding of human brains to be closer to the ideal.
But it'd really be nice if our AI could do unsupervised learning to figure out the details of human optimization. Then we could simply dump the internet into it, and let it grind away at the data and spit out a detailed, complete model of human decision-making, from which our utility function could be derived. Unfortunately, this does not seem to be a tractable problem. It's possible that some insight could be gleaned by examining outliers with normal intelligence, but deviant utility functions (I am thinking specifically of sociopaths), but it's unclear how much insight can be produced by these methods. If anyone has suggestions for a more efficient way of going about it, I'd love to hear it. As it stands, it might be possible to get enough information from this to supplement a supervised learning approach - the closer we get to a perfectly accurate model, the higher the probability of Things Going Well.
Anyways, that's where I am right now. I just thought I'd put up my thoughts and see if some fresh eyes see anything I've been missing.
Cheers,
Niger
Call for a Friendly AI channel on freenode
I visited #lesswrong on freenode yesterday and was able to get in some discussion of FAI-related matters. But that channel also exists to allow discussion of rationalist fanfiction, political opinions, and whatever else people want to talk about.
I would like for there to be a place on that network where the topic actually is Friendly AI - where you can go to brainstorm, and maybe you'll have to wait because they're already talking about cognitive neuroscience or automated theorem provers, but not because they're talking about ponies or politics.
Surely there are enough people with a serious, technical interest in FAI and related topics (and I don't just mean among LW regulars) to make such a channel sustainable. I'll bet that there are other people holding back from participation precisely because existing forums are so full of uninformed noise and conversational tangents. It's inevitable that entropy would set in after a while, but if the default baseline was still that even the chatter was technically informed and focused on what's coming - that would be mission accomplished.
I explored the freenode namespace a little. #FAI redirects to #unavailable, so it may be an abandoned project. #AGI exists but is invite-only. #AI exists but I'm told it's dull, and besides, the agenda here is meant to be, not just AI, but singularity-relevant AI. So there seems to be an opening. Or am I reinventing the wheel?
AI "Boxing" and Utility Functions
So, I had this idea the other day when I was thinking about how to safely conduct research on potentially-FOOM-capable AI software. I'd like to sketch it out briefly and then get feedback on it.
So, this started out with the idea that an AI based on AIXI is, in some sense, safer than a fully functional AI, due to the existence of the anvil problem. Because AIXI can't conceive of its own nonexistence, it has no preference ordering over its own mortality, and won't (shouldn't) resist any attempt to shut it down. In other words, if AIXI starts to FOOM undesirably out of control, you actually can go pull the plug on it without fuss. Unfortunately, in terms of safety, the anvil problem gives AIXI a number of other undesirable properties: both third parties and the AI itself can modify its utility function at any time, for any reason, which is very unstable. However, I think a similar idea might be useful for reducing (though not eliminating) the existential risks posed by powerful optimization processes in the near term.
Say you have a piece of AI software ω, with an unFriendly instantaneous utility function, {maximize U}. You would like to use ω for some industrial application (say, manufacturing paperclips), but you're concerned about it FOOMing and resulting in human extinction. You decide to 'box' the AI, but, having read up on the subject, you are worried about it outsmarting you if you try to disable it.
So, you replace your original utility function U with a modified version in terms of U, U'.
U' = { max(U) | if ω exists
Ø | if ω !exist}
U' has several useful qualities. The agent will resist modifications to its utility function, while not resisting attempts to turn it off. It is entirely ambivalent towards its own existence. As a result, if it began to FOOM undesirably, stopping it would be fairly trivial. Furthermore, the AI would have no incentive to deceive us, so it'd be fairly easy to keep an eye on.
It should be noted that this is not in any way a final solution to the FAI problem. If the AI FOOMS quickly enough that we all die or get brain-hacked before we realize something's amiss, it could still all go very poorly. But that seems unlikely. I would guess that it takes at least a day or two to go from a boxed human-level AI to something strongly superhuman. Unfortunately, for this to work, everyone has to use it, which leave a lot of leftover existential risk from people using AIs without stable utility functions, cranks who think unFriendly AI will discover universal morality, and people who prematurely think they've figured out a good Friendly utility function.
That said, something like this could help to gain more time to develop a proper FAI, and would be relatively simple to sell other developers on. SI or a similar organization could even develop a standardized, cross-platform open-source software package for utility functions with all of this built in, and distribute it to wannabe strong-AI developers.
Are there any obvious problems with this idea that I'm missing? If so, can you think of any ways to address them? Has this sort of thing been discussed in the past?
What does the world look like, the day before FAI efforts succeed?
TL;DR: let's visualize what the world looks like if we successfully prepare for the Singularity.
I remember reading once, though I can't remember where, about a technique called 'contrasting'. The idea is to visualize a world where you've accomplished your goals, and visualize the current world, and hold the two worlds in contrast to each other. Apparently there was a study about this; the experimental 'contrasting' group was more successful than the control in accomplishing its goals.
It occurred to me that we need some of this. Strategic insights about the path to FAI are not robust or likely to be highly reliable. And in order to find a path forward, you need to know where you're trying to go. Thus, some contrasting:
It's the year 20XX. The time is 10 AM, on the day that will thereafter be remembered as the beginning of the post-Singularity world. Since the dawn of the century, a movement rose in defense of humanity's future. What began with mailing lists and blog posts became a slew of businesses, political interventions, infrastructure improvements, social influences, and technological innovations designed to ensure the safety of the world.
Despite all odds, we exerted a truly extraordinary effort, and we did it. The AI research is done; we've laboriously tested and re-tested our code, and everyone agrees that the AI is safe. It's time to hit 'Run'.
And so I ask you, before we hit the button: what does this world look like? In the scenario where we nail it, which achievements enabled our success? Socially? Politically? Technologically? What resources did we acquire? Did we have superior technology, or a high degree of secrecy? Was FAI research highly prestigious, attractive, and well-funded? Did we acquire the ability to move quickly, or did we slow unFriendly AI research efforts? What else?
I had a few ideas, which I divided between scenarios where we did a 'fantastic', 'good', or 'sufficient' job at preparing for the Singularity. But I need more ideas! I'd like to fill this out in detail, with the help of Less Wrong. So if you have ideas, write them in the comments, and I'll update the list.
Some meta points:
- This speculation is going to be, well, pretty speculative. That's fine - I'm just trying to put some points on the map.
- However, I'd like to get a list of reasonable possibilities, not detailed sci-fi stories. Do your best.
- In most cases, I'd like to consolidate categories of possibilities. For example, we could consolidate "the FAI team has exclusive access to smart drugs" and "the FAI team has exclusive access to brain-computer interfaces" into "the FAI team has exclusive access to intelligence-amplification technology."
- However, I don't want too much consolidation. For example, I wouldn't want to consolidate "the FAI team gets an incredible amount of government funding" and "the FAI team has exclusive access to intelligence-amplification technology" into "the FAI team has a lot of power".
- Lots of these possibilities are going to be mutually exclusive; don't see them as aspects of the same scenario, but rather different scenarios.
Anyway - I'll start.
Visualizing the pre-FAI world
- Fantastic scenarios
- The FAI team has exclusive access to intelligence amplification technology, and use it to ensure Friendliness & strategically reduce X-risk.
- The government supports Friendliness research, and contributes significant resources to the problem.
- The government actively implements legislation which FAI experts and strategists believe has a high probability of making AI research safer.
- FAI research becomes a highly prestigious and well-funded field, relative to AGI research.
- Powerful social memes exist regarding AI safety; any new proposal for AI research is met with a strong reaction (among the populace and among academics alike) asking about safety precautions. It is low status to research AI without concern for Friendliness.
- The FAI team discovers important strategic insights through a growing ecosystem of prediction technology; using stables of experts, prediction markets, and opinion aggregation.
- The FAI team implements deliberate X-risk reduction efforts to stave off non-AI X-risks. Those might include a global nanotech immune system, cheap and rigorous biotech tests and safeguards, nuclear safeguards, etc.
- The FAI team implements the infrastructure for a high-security research effort, perhaps offshore, implementing the best available security measures designed to reduce harmful information leaks.
- Giles writes: Large amounts of funding are available, via government or through business. The FAI team and its support network may have used superior rationality to acquire very large amounts of money.
- Giles writes: The technical problem of establishing Friendliness is easier than expected; we are able t construct a 'utility function' (or a procedure for determining such a function) in order to implement human values that people (including people with a broad range of expertise) are happy with.
- Crude_Dolorium writes: FAI research proceeds much faster than AI research, so by the time we can make a superhuman AI, we already know how to make it Friendly (and we know what we really want that to mean).
- Pretty good scenarios
- Intelligence amplification technology access isn't exclusive to the FAI team, but it is differentially adopted by the FAI team and their supporting network, resulting in a net increase in FAI team intelligence relative to baseline. The FAI team uses it to ensure Friendliness and implement strategy surrounding FAI research.
- The government has extended some kind of support for Friendliness research, such as limited funding. No protective legislation is forthcoming.
- FAI research becomes slightly more high status than today, and additional researchers are attracted to answer important open questions about FAI.
- Friendliness and rationality memes grow at a reasonable rate, and by the time the Friendliness program occurs, society is more sane.
- We get slightly better at making predictions, mostly by refining our current research and discussion strategies. This allows us a few key insights that are instrumental in reducing X-risk.
- Some X-risk reduction efforts have been implemented, but with varying levels of success. Insights about which X-risk efforts matter are of dubious quality, and the success of each effort doesn't correlate well to the seriousness of the X-risk. Nevertheless, some X-risk reduction is achieved, and humanity survives long enough to implement FAI.
- Some security efforts are implemented, making it difficult but not impossible for pre-Friendly AI tech to be leaked. Nevertheless, no leaks happen.
- Giles writes: Funding is harder to come by, but small donations, limited government funding, or moderately successful business efforts suffice to fund the FAI team.
- Giles writes: The technical problem of aggregating values through a Friendliness function is difficult; people have contradictory and differing values. However, there is broad agreement as to how to aggregate preferences. Most people accept that FAI needs to respect values of humanity as a whole, not just their own.
- Crude_Dolorium writes: Superhuman AI arrives before we learn how to make it Friendly, but we do learn how to make an 'Anchorite' AI that definitely won't take over the world. The first superhuman AIs use this architecture, and we use them to solve the harder problems of FAI before anyone sets off an exploding UFAI.
- Sufficiently good scenarios
- Intelligence amplification technology is widespread, preventing any differential adoption by the FAI team. However, FAI researchers are able to keep up with competing efforts to use that technology for AI research.
- The government doesn't support Friendliness research, but the research group stays out of trouble and avoids government interference.
- FAI research never becomes prestigious or high-status, but the FAI team is able to answer the important questions anyway.
- Memes regarding Friendliness aren't significantly more widespread than today, but the movement has grown enough to attract the talent necessary to implement a Friendliness program.
- Predictive ability is no better than it is today, but the few insights we've gathered suffice to build the FAI team and make the project happen.
- There are no significant and successful X-risk reduction efforts, but humanity survives long enough to implement FAI anyway.
- No significant security measures are implemented for the FAI project. Still, via cooperation and because the team is relatively unknown, no dangerous leaks occur.
- Giles writes: The team is forced to operate on a shoestring budget, but succeeds anyway because the problem turns out to not be incredibly sensitive to funding constraints.
- Giles writes: The technical problem of aggregating values is incredibly difficult. Many important human values contradict each other, and we have discovered no "best" solution to those conflicts. Most people agree on the need for a compromise but quibble over how that compromise should be reached. Nevertheless, we come up with a satisfactory compromise.
- Crude_Dolorium writes: The problems of Friendliness aren't solved in time, or the solutions don't apply to practical architectures, or the creators of the first superhuman AIs don't use them, so the AIs have only unreliable safeguards. They're given cheap, attainable goals; the creators have tools to read the AIs' minds to ensure they're not trying anything naughty, and killswitches to stop them; they have an aversion to increasing their intelligence beyond a certain point, and to whatever other failure modes the creators anticipate; they're given little or no network connectivity; they're kept ignorant of facts more relevant to exploding than to their assigned tasks; they require special hardware, so it's harder for them to explode; and they're otherwise designed to be safer if not actually safe. Fortunately they don't encounter any really dangerous failure modes before they're replaced with descendants that really are safe.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)













Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)