Notes on the Safety in Artificial Intelligence conference
These are my notes and observations after attending the Safety in Artificial Intelligence (SafArtInt) conference, which was co-hosted by the White House Office of Science and Technology Policy and Carnegie Mellon University on June 27 and 28. This isn't an organized summary of the content of the conference; rather, it's a selection of points which are relevant to the control problem. As a result, it suffers from selection bias: it looks like superintelligence and control-problem-relevant issues were discussed frequently, when in reality those issues were discussed less and I didn't write much about the more mundane parts.
SafArtInt has been the third out of a planned series of four conferences. The purpose of the conference series was twofold: the OSTP wanted to get other parts of the government moving on AI issues, and they also wanted to inform public opinion.
The other three conferences are about near term legal, social, and economic issues of AI. SafArtInt was about near term safety and reliability in AI systems. It was effectively the brainchild of Dr. Ed Felten, the deputy U.S. chief technology officer for the White House, who came up with the idea for it last year. CMU is a top computer science university and many of their own researchers attended, as well as some students. There were also researchers from other universities, some people from private sector AI including both Silicon Valley and government contracting, government researchers and policymakers from groups such as DARPA and NASA, a few people from the military/DoD, and a few control problem researchers. As far as I could tell, everyone except a few university researchers were from the U.S., although I did not meet many people. There were about 70-100 people watching the presentations at any given time, and I had conversations with about twelve of the people who were not affiliated with existential risk organizations, as well as of course all of those who were affiliated. The conference was split with a few presentations on the 27th and the majority of presentations on the 28th. Not everyone was there for both days.
Felten believes that neither "robot apocalypses" nor "mass unemployment" are likely. It soon became apparent that the majority of others present at the conference felt the same way with regard to superintelligence. The general intention among researchers and policymakers at the conference could be summarized as follows: we need to make sure that the AI systems we develop in the near future will not be responsible for any accidents, because if accidents do happen then they will spark public fears about AI, which would lead to a dearth of funding for AI research and an inability to realize the corresponding social and economic benefits. Of course, that doesn't change the fact that they strongly care about safety in its own right and have significant pragmatic needs for robust and reliable AI systems.
Most of the talks were about verification and reliability in modern day AI systems. So they were concerned with AI systems that would give poor results or be unreliable in the narrow domains where they are being applied in the near future. They mostly focused on "safety-critical" systems, where failure of an AI program would result in serious negative consequences: automated vehicles were a common topic of interest, as well as the use of AI in healthcare systems. A recurring theme was that we have to be more rigorous in demonstrating safety and do actual hazard analyses on AI systems, and another was that we need the AI safety field to succeed in ways that the cybersecurity field has failed. Another general belief was that long term AI safety, such as concerns about the ability of humans to control AIs, was not a serious issue.
On average, the presentations were moderately technical. They were mostly focused on machine learning systems, although there was significant discussion of cybersecurity techniques.
The first talk was given by Eric Horvitz of Microsoft. He discussed some approaches for pushing into new directions in AI safety. Instead of merely trying to reduce the errors spotted according to one model, we should look out for "unknown unknowns" by stacking models and looking at problems which appear on any of them, a theme which would be presented by other researchers as well in later presentations. He discussed optimization under uncertain parameters, sensitivity analysis to uncertain parameters, and 'wireheading' or short-circuiting of reinforcement learning systems (which he believes can be guarded against by using 'reflective analysis'). Finally, he brought up the concerns about superintelligence, which sparked amused reactions in the audience. He said that scientists should address concerns about superintelligence, which he aptly described as the 'elephant in the room', noting that it was the reason that some people were at the conference. He said that scientists will have to engage with public concerns, while also noting that there were experts who were worried about superintelligence and that there would have to be engagement with the experts' concerns. He did not comment on whether he believed that these concerns were reasonable or not.
An issue which came up in the Q&A afterwards was that we need to deal with mis-structured utility functions in AI, because it is often the case that the specific tradeoffs and utilities which humans claim to value often lead to results which the humans don't like. So we need to have structural uncertainty about our utility models. The difficulty of finding good objective functions for AIs would eventually be discussed in many other presentations as well.
The next talk was given by Andrew Moore of Carnegie Mellon University, who claimed that his talk represented the consensus of computer scientists at the school. He claimed that the stakes of AI safety were very high - namely, that AI has the capability to save many people's lives in the near future, but if there are any accidents involving AI then public fears could lead to freezes in AI research and development. He highlighted the public's irrational tendencies wherein a single accident could cause people to overlook and ignore hundreds of invisible lives saved. He specifically mentioned a 12-24 month timeframe for these issues.
Moore said that verification of AI system safety will be difficult due to the combinatorial explosion of AI behaviors. He talked about meta-machine-learning as a solution to this, something which is being investigated under the direction of Lawrence Schuette at the Office of Naval Research. Moore also said that military AI systems require high verification standards and that development timelines for these systems are long. He talked about two different approaches to AI safety, stochastic testing and theorem proving - the process of doing the latter often leads to the discovery of unsafe edge cases.
He also discussed AI ethics, giving an example 'trolley problem' where AI cars would have to choose whether to hit a deer in order to provide a slightly higher probability of survival for the human driver. He said that we would need hash-defined constants to tell vehicle AIs how many deer a human is worth. He also said that we would need to find compromises in death-pleasantry tradeoffs, for instance where the safety of self-driving cars depends on the speed and routes on which they are driven. He compared the issue to civil engineering where engineers have to operate with an assumption about how much money they would spend to save a human life.
He concluded by saying that we need policymakers, company executives, scientists, and startups to all be involved in AI safety. He said that the research community stands to gain or lose together, and that there is a shared responsibility among researchers and developers to avoid triggering another AI winter through unsafe AI designs.
The next presentation was by Richard Mallah of the Future of Life Institute, who was there to represent "Medium Term AI Safety". He pointed out the explicit/implicit distinction between different modeling techniques in AI systems, as well as the explicit/implicit distinction between different AI actuation techniques. He talked about the difficulty of value specification and the concept of instrumental subgoals as an important issue in the case of complex AIs which are beyond human understanding. He said that even a slight misalignment of AI values with regard to human values along one parameter could lead to a strongly negative outcome, because machine learning parameters don't strictly correspond to the things that humans care about.
Mallah stated that open-world discovery leads to self-discovery, which can lead to reward hacking or a loss of control. He underscored the importance of causal accounting, which is distinguishing causation from correlation in AI systems. He said that we should extend machine learning verification to self-modification. Finally, he talked about introducing non-self-centered ontology to AI systems and bounding their behavior.
The audience was generally quiet and respectful during Richard's talk. I sensed that at least a few of them labelled him as part of the 'superintelligence out-group' and dismissed him accordingly, but I did not learn what most people's thoughts or reactions were. In the next panel featuring three speakers, he wasn't the recipient of any questions regarding his presentation or ideas.
Tom Mitchell from CMU gave the next talk. He talked about both making AI systems safer, and using AI to make other systems safer. He said that risks to humanity from other kinds of issues besides AI were the "big deals of 2016" and that we should make sure that the potential of AIs to solve these problems is realized. He wanted to focus on the detection and remediation of all failures in AI systems. He said that it is a novel issue that learning systems defy standard pre-testing ("as Richard mentioned") and also brought up the purposeful use of AI for dangerous things.
Some interesting points were raised in the panel. Andrew did not have a direct response to the implications of AI ethics being determined by the predominantly white people of the US/UK where most AIs are being developed. He said that ethics in AIs will have to be decided by society, regulators, manufacturers, and human rights organizations in conjunction. He also said that our cost functions for AIs will have to get more and more complicated as AIs get better, and he said that he wants to separate unintended failures from superintelligence type scenarios. On trolley problems in self driving cars and similar issues, he said "it's got to be complicated and messy."
Dario Amodei of Google Deepbrain, who co-authored the paper on concrete problems in AI safety, gave the next talk. He said that the public focus is too much on AGI/ASI and wants more focus on concrete/empirical approaches. He discussed the same problems that pose issues in advanced general AI, including flawed objective functions and reward hacking. He said that he sees long term concerns about AGI/ASI as "extreme versions of accident risk" and that he thinks it's too early to work directly on them, but he believes that if you want to deal with them then the best way to do it is to start with safety in current systems. Mostly he summarized the Google paper in his talk.
In her presentation, Claire Le Goues of CMU said "before we talk about Skynet we should focus on problems that we already have." She mostly talked about analogies between software bugs and AI safety, the similarities and differences between the two and what we can learn from software debugging to help with AI safety.
Robert Rahmer of IARPA discussed CAUSE, a cyberintelligence forecasting program which promises to help predict cyber attacks. It is a program which is still being put together.
In the panel of the above three, autonomous weapons were discussed, but no clear policy stances were presented.
John Launchbury gave a talk on DARPA research and the big picture of AI development. He pointed out that DARPA work leads to commercial applications and that progress in AI comes from sustained government investment. He classified AI capabilities into "describing," "predicting," and "explaining" in order of increasing difficulty, and he pointed out that old fashioned "describing" still plays a large role in AI verification. He said that "explaining" AIs would need transparent decisionmaking and probabilistic programming (the latter would also be discussed by others at the conference).
The next talk came from Jason Gaverick Matheny, the director of IARPA. Matheny talked about four requirements in current and future AI systems: verification, validation, security, and control. He wanted "auditability" in AI systems as a weaker form of explainability. He talked about the importance of "corner cases" for national intelligence purposes, the low probability, high stakes situations where we have limited data - these are situations where we have significant need for analysis but where the traditional machine learning approach doesn't work because of its overwhelming focus on data. Another aspect of national defense is that it has a slower decision tempo, longer timelines, and longer-viewing optics about future events.
He said that assessing local progress in machine learning development would be important for global security and that we therefore need benchmarks to measure progress in AIs. He ended with a concrete invitation for research proposals from anyone (educated or not), for both large scale research and for smaller studies ("seedlings") that could take us "from disbelief to doubt".
The difference in timescales between different groups was something I noticed later on, after hearing someone from the DoD describe their agency as having a longer timeframe than the Homeland Security Agency, and someone from the White House describe their work as being crisis reactionary.
The next presentation was from Andrew Grotto, senior director of cybersecurity policy at the National Security Council. He drew a close parallel from the issue of genetically modified crops in Europe in the 1990's to modern day artificial intelligence. He pointed out that Europe utterly failed to achieve widespread cultivation of GMO crops as a result of public backlash. He said that the widespread economic and health benefits of GMO crops were ignored by the public, who instead focused on a few health incidents which undermined trust in the government and crop producers. He had three key points: that risk frameworks matter, that you should never assume that the benefits of new technology will be widely perceived by the public, and that we're all in this together with regard to funding, research progress and public perception.
In the Q&A between Launchbury, Matheny, and Grotto after Grotto's presentation, it was mentioned that the economic interests of farmers worried about displacement also played a role in populist rejection of GMOs, and that a similar dynamic could play out with regard to automation causing structural unemployment. Grotto was also asked what to do about bad publicity which seeks to sink progress in order to avoid risks. He said that meetings like SafArtInt and open public dialogue were good.
One person asked what Launchbury wanted to do about AI arms races with multiple countries trying to "get there" and whether he thinks we should go "slow and secure" or "fast and risky" in AI development, a question which provoked laughter in the audience. He said we should go "fast and secure" and wasn't concerned. He said that secure designs for the Internet once existed, but the one which took off was the one which was open and flexible.
Another person asked how we could avoid discounting outliers in our models, referencing Matheny's point that we need to include corner cases. Matheny affirmed that data quality is a limiting factor to many of our machine learning capabilities. At IARPA, we generally try to include outliers until they are sure that they are erroneous, said Matheny.
Another presentation came from Tom Dietterich, president of the Association for the Advancement of Artificial Intelligence. He said that we have not focused enough on safety, reliability and robustness in AI and that this must change. Much like Eric Horvitz, he drew a distinction between robustness against errors within the scope of a model and robustness against unmodeled phenomena. On the latter issue, he talked about solutions such as expanding the scope of models, employing multiple parallel models, and doing creative searches for flaws - the latter doesn't enable verification that a system is safe, but it nevertheless helps discover many potential problems. He talked about knowledge-level redundancy as a method of avoiding misspecification - for instance, systems could identify objects by an "ownership facet" as well as by a "goal facet" to produce a combined concept with less likelihood of overlooking key features. He said that this would require wider experiences and more data.
There were many other speakers who brought up a similar set of issues: the user of cybersecurity techniques to verify machine learning systems, the failures of cybersecurity as a field, opportunities for probabilistic programming, and the need for better success in AI verification. Inverse reinforcement learning was extensively discussed as a way of assigning values. Jeanette Wing of Microsoft talked about the need for AIs to reason about the continuous and the discrete in parallel, as well as the need for them to reason about uncertainty (with potential meta levels all the way up). One point which was made by Sarah Loos of Google was that proving the safety of an AI system can be computationally very expensive, especially given the combinatorial explosion of AI behaviors.
In one of the panels, the idea of government actions to ensure AI safety was discussed. No one was willing to say that the government should regulate AI designs. Instead they stated that the government should be involved in softer ways, such as guiding and working with AI developers, and setting standards for certification.
Pictures: https://imgur.com/a/49eb7
In between these presentations I had time to speak to individuals and listen in on various conversations. A high ranking person from the Department of Defense stated that the real benefit of autonomous systems would be in terms of logistical systems rather than weaponized applications. A government AI contractor drew the connection between Mallah's presentation and the recent press revolving around superintelligence, and said he was glad that the government wasn't worried about it.
I talked to some insiders about the status of organizations such as MIRI, and found that the current crop of AI safety groups could use additional donations to become more established and expand their programs. There may be some issues with the organizations being sidelined; after all, the Google Deepbrain paper was essentially similar to a lot of work by MIRI, just expressed in somewhat different language, and was more widely received in mainstream AI circles.
In terms of careers, I found that there is significant opportunity for a wide range of people to contribute to improving government policy on this issue. Working at a group such as the Office of Science and Technology Policy does not necessarily require advanced technical education, as you can just as easily enter straight out of a liberal arts undergraduate program and build a successful career as long as you are technically literate. (At the same time, the level of skepticism about long term AI safety at the conference hinted to me that the signalling value of a PhD in computer science would be significant.) In addition, there are large government budgets in the seven or eight figure range available for qualifying research projects. I've come to believe that it would not be difficult to find or create AI research programs that are relevant to long term AI safety while also being practical and likely to be funded by skeptical policymakers and officials.
I also realized that there is a significant need for people who are interested in long term AI safety to have basic social and business skills. Since there is so much need for persuasion and compromise in government policy, there is a lot of value to be had in being communicative, engaging, approachable, appealing, socially savvy, and well-dressed. This is not to say that everyone involved in long term AI safety is missing those skills, of course.
I was surprised by the refusal of almost everyone at the conference to take long term AI safety seriously, as I had previously held the belief that it was more of a mixed debate given the existence of expert computer scientists who were involved in the issue. I sensed that the recent wave of popular press and public interest in dangerous AI has made researchers and policymakers substantially less likely to take the issue seriously. None of them seemed to be familiar with actual arguments or research on the control problem, so their opinions didn't significantly change my outlook on the technical issues. I strongly suspect that the majority of them had their first or possibly only exposure to the idea of the control problem after seeing badly written op-eds and news editorials featuring comments from the likes of Elon Musk and Stephen Hawking, which would naturally make them strongly predisposed to not take the issue seriously. In the run-up to the conference, websites and press releases didn't say anything about whether this conference would be about long or short term AI safety, and they didn't make any reference to the idea of superintelligence.
I sympathize with the concerns and strategy given by people such as Andrew Moore and Andrew Grotto, which make perfect sense if (and only if) you assume that worries about long term AI safety are completely unfounded. For the community that is interested in long term AI safety, I would recommend that we avoid competitive dynamics by (a) demonstrating that we are equally strong opponents of bad press, inaccurate news, and irrational public opinion which promotes generic uninformed fears over AI, (b) explaining that we are not interested in removing funding for AI research (even if you think that slowing down AI development is a good thing, restricting funding yields only limited benefits in terms of changing overall timelines, whereas those who are not concerned about long term AI safety would see a restriction of funding as a direct threat to their interests and projects, so it makes sense to cooperate here in exchange for other concessions), and (c) showing that we are scientifically literate and focused on the technical concerns. I do not believe that there is necessarily a need for the two "sides" on this to be competing against each other, so it was disappointing to see an implication of opposition at the conference.
Anyway, Ed Felten announced a request for information from the general public, seeking popular and scientific input on the government's policies and attitudes towards AI: https://www.whitehouse.gov/webform/rfi-preparing-future-artificial-intelligence
Overall, I learned quite a bit and benefited from the experience, and I hope the insight I've gained can be used to improve the attitudes and approaches of the long term AI safety community.
Tackling the subagent problem: preliminary analysis
A putative new idea for AI control; index here.
Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.
The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").
So if the problem could be solved, many other control approaches could be potentially available.
The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.
Controlling subagents
Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.
- Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive ("Twenty million questions") or undetectable.
- Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.
- Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.
- Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them.
These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.
We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).
Subagents processing information
Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?
Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.
Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.
It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.
But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.
Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.
To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.
Still feels likely that this will fail, though, without something more.
Ask and ye shall be answered
A putative new idea for AI control; index here.
EDIT: the mathematics of this idea can be found here.
It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.
The question is: what is the impact of X/¬X on the expected utility of a utility function v? Here X/¬X is some well-defined binary outcome. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where
Q(v,X,c) = {E(v|X) - E(v|¬X) > c}.
How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y (or ¬Y), and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, erases the bit from Y before we can see it. In actual fact, P(Z=0)≈1.
The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0 (note: there are subtleties to defining indifference to events that are partially under the AI's control; I will address these in a later post). Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was erased and never read). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.
So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs ¬Y, it expects to get utility E(v|¬X). If it outputs Y, it expects to get utility E(v|X)-c (because of the -c penalty).
Therefore (ignoring ties) M(v#) will output Y iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!
Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".
Agents detecting agents: counterfactual versus influence
A putative new idea for AI control; index here. Status: still somewhat underdeveloped.
In this post, I want to explore another avenue for controlling the ongoing influence of an AI (through subagents or other means). It is the idea, touched upon here and here, of figuring out whether the AI's existence was important or not for predicting the future. Informally, if you knew the AI was turned on or not, would you consider that information valuable or almost irrelevant?
Agent vs agent
This post rests on two insights. The first is that, though it is hard to define what an agent or subagent is, we can produce examples of them. Or at least, at the point where we are capable of producing AIs that might be dangerous, we can likely produce examples of them. This can give us a scale for comparing the importance or impact of different agents.
The second is that if we produce an agent with certain motivations, it will likely expand and fill its light cone with... stuff (of some type or other). Those who don't agree with the AI risk thesis argument may still accept that we could program universe-optimising AIs (ie "we could make a paperclipper, if we were stupid enough to do so").
Why is this relevant? Well, consider an agent E (expansionist) that fills its light cone with stuff (likely including manufacturing and tracking devices), eventually at very high speed. And imagine that there is another agent H (hidden) inside that light cone. At some point, agent E will either detect agent H (and make plans to deal with it), or find its plans hampered by agent H, or agent H will go into deep hiding that does not visibly affect agent E's plans.
Therefore, an expansionist agent is a perfect tool for detecting the influence of other agents. Either agent E will have difficulty accomplishing its plans, or agent H's influence will be so limited that agent E cannot find any trace of it. Any influential agent will get detected, sooner or later, by an expansionist agent. So if E finds H unimportant, then H almost certainly is.
Of course, we don't really want expansionist agents like E - but these can be counterfactual.
The stage and the players
We'll now try and define precisely how we can use this idea to control the influence of an agent, setting out the axioms and assumptions needed along the way. Like some mathematics paper, it might be useful to skip lightly over this section and return later.
Let S(u) be the AI whose influence we want to control. The utility u is what gives it it's preferences, and S() indicates that we wish it to be a satisficer or reduced impact AI. Let M(v) be the counterfactual agent that we wish to use to detect S(u)'s influence, for some utility v. The M() designates the agent as a maximiser, without any control on its impact. If we can define S(), we can probably define M() by simply removing its reduced impact restrictions.
Let X be an event leading to S(u) being turned on, Y(v) an event leading to M(v) being turned on. We may as well assume that X and Y(v) are outside each other's light cones (in practice, we will set X≈1 and Y(v)≈0). Then the first two assumption is:
- A.1. We can define S(u), M(v), X and Y(v).
Another useful tool might be a weak maximising agent. One that attempts to maximise u, and will do so to some extent, but that can easily be thwarted by humans or others. It's likely that by the time we can build S(u), we can build a weak maximiser, designated by WM(v), with the event WY(v) denoting it's turning on:
- A.2. We can define WM(v) and WY(v).
We might also be able to define an abstract, Platonic superagent. This is an idealised agent with immense computing power (but that doesn't exist in the real world). Call this AM; obviously, there is no event leading to AM being 'turned on'. We would want AM's goals to be defined in terms of 'estimating the utility/probability/expectation of certain events'.
- A.3. We can define AM to estimate what we need to have estimated.
Of course, we need to be able to point to these various definitions from within S(u)'s code, so:
- A.4. If we can define concepts, we can also define then within the various agents.
We also need agents to be able to say sensible things about the behaviour of the other agents:
- A.5. If these agents are defined, S(u), WM(v), M(v) and AM can estimate each other's behaviours.
This could be done by using, for example, the reflexive probability of Paul Christiano and Benja Fallenstein.
Now, how about v? We cannot choose any v. We want M(v) to not think that X (hence S(u)'s existence) is an important fact to know. But if v is define by "if X, build staples; if ¬X, build paperclips", then X will be a very important! So we need to define a class V of utility functions that are indifferent to S(u) and X. Or not quite indifferent, but don't care intrinsically about them - they certainly should care if S(u) eats all their resources! See the ideas here for approaches to do this.
- A.6. We can define V, a class of utility functions that do not care intrinsically about S(u) and X.
Ideally we'd want this class to be as broad as possible - certainly too large for anything in this universe to parse element by element.
Now let's turn to the issue of measuring "importance". Define e to be evidence about the universe prior to X, Y(v), or WY(v) .We want X (hence the creation of S(u)) to be unimportant, ie of little information concerning the future. It might possible to have a universal definition of importance of certain facts:
- A.7. We have a universal measure I of importance, which, given background evidence e, can rank the importance of knowing certain facts versus knowing others.
If we do not have such an I (and I feel it's moderately unlikely that one exists, though we may be able to find an I' which would be good enough to work for our purposes), we can use a utility v to estimate the importance of an event W. The trick is to compare E(v|W) with E(v|¬W), which would give an (uncalibrated) estimation of the importance of W to v.
Right, we're now ready to start doing things.
Predicting... what?
What are we trying to predict here? Well, estimates of the importance of certain events. Specifically, the importance of X (which leads to the creation of S(u)) versus other events. For instance, if Y(v) designates the creation of M(v), we would want to compare the importance of of X and Y(v) in (at least) four ways:
- I(Y(v) | X∧e)
- I(Y(v) | ¬X∧e)
- I(X | Y(v)∧e)
- I(X | ¬Y(v)∧e)
These measure the importance of Y(v) (hence M(v)) and X (hence S(u)) in universes where the other agent does or does not exist. If we expect S(u) to be of low impact, we would want things like I(Y(v) | X∧e) >> I(X | Y(v)∧e) (S(u) is much less important than the maximiser M(v)) and |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether M(v) exists or not).
Now, any particular v might be misleading, so we would want to check that this result held up for all v ∈ V. We could take a maximum over V, or a softmax.
If WM(v) can be defined, we would want to do the same estimates with WY(v) instead of Y(v), and desire I(WY(v) | X∧e) >> I(X | WY(v)∧e) (S(u) is less important than the weak maximiser WM(v)) and |I(X | WY(v)∧e) - I(X | ¬WY(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether WM(v) exists or not), again maxing or softmaxing across V.
Who predicts, how?
In the previous section, we used I to compute the importance of S(u) versus other agents. But what if axiom A.7 is wrong, and we don't have such an I? Well, there we can measure the importance of these events, given a particular v, using AM, writing AMv(X|Y(v')∧e) for |EAM(v|X∧Y(v')∧e) - EAM(v|¬X∧Y(v')∧e)|, the expected difference (that AM estimates) in v given X versus ¬X (ie dependent on whether S(u) was turned on or not). More precisely, we'll want to use S(u)(AMv(X|Y(v')∧e)) - the expected difference (that S(u) estimates AM estimates) in v etc...
Here we'll want to do a double max/softmax. First over v', so that we consider all M(v') for v' ∈ V. Then over v, so that AM estimates the expected difference to all possible v. In this way, max/softmax AMv functions like an I. But not a perfect one: just one that detects disruptions to the multitude of v's in V. Therefore, we are essentially using the set V as the twenty million question method for the impact of S(u). We need to take extra care with this, to ensure we haven't left any gaping holes where S(u) can still have large impact.
Now, if axiom A.3 was false, we need to use M(v) instead of AMv to estimate this quantity. We need to do this with some care. M(v) is (counterfactually) a real agent. Why would it want to calculate the importance of M(v') and S(u), and why would S(u) care about this? Well, we can do so by constructing the scenario carefully. Maybe M(v') and S(u) are due to be created in the future, and M(v) is a boxed agent that can choose to prevent Y(v') and X. We modify v to v#, so that M(v#) wants to maximise v, but takes specific numerical penalties if it chooses to prevent Y(v') or X. By modifying these numerical penalties, M(v#) is motivated to figure out the importance of M(v') and S(u), and to reveal its opinions through its actions. If we create a S(u#), also boxed, that could override M(v#) at a certain cost and create a specific S(u), we can motivate it to model M(v#)'s actions. From this we can construct an importance metric using the M(v).
The WM(v), on the other hand, are probably too limited to be used effectively as predictors.
Arbitrary agents
One issue with this method is that Y(v) and WY(v) refer to the creation of specific agents, in the specific ways we've setup. For instance, |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 means that the importance of X doesn't vary much given Y(v) or its opposite. What this really means is that I(X | ¬Y(v)∧e) is bounded by I(X | Y(v)∧e) - but that might be quite high. There may be details of the setup that allow S(u) to have an important role given the specific M(v) created in this way - for instance, if it reached out to try and destroy it. Then I(X | Y(v)∧e) could be high, hence I(X | ¬Y(v)∧e) could be as well. The restrictions with >> could still contain S(u), but >> is ill-defined.
More secure would be if Y(v) could be defined as the creation of an agent M(v) in an abstract way. Maybe something like "is 'this object is a v maximiser' a good model of this object's properties and impact?". We would also want to give M(v) some starting resources. Again, note it is very hard to abstractly define what "resources" mean, but considerably easier to define a set of "useful resources" we could give to an agent. We would probably want to bound this definition in time and space as well.
This gives us the next axioms:
- A.8. Y(v) defines the generic existence of an agent M(v).
- A.9. WY(v) defines the generic existence of an agent WM(v).
With these axioms, our measure of importance becomes more general: S(u) is less important that abstract agents M(v) or WM(v), rather than certain specific ones. Note that we don't need to worry that S(u) snuffs out any hypothetical M(v): since the importance of S(u) is bounded below that if M(v), that would reduce S(u) to zero importance.
Extending to humans
What other agents do we know in the world? How about those 7+ billion ones wandering about on its surface? We can certainly define events Z(h) that might lead to the death of a particular human h - of course, we'd want Z(h) as low as feasibly possible!
- A.10. We can define events Z(h) that correspond to the death of a human h.
Then we can use Z(h) instead of Y(v) and WY(v), to ensure that S(u) has importance comparable with a human. Of course, the ideal would be to define, abstractly, the death (or birth+maturation) of a human in a generic way!
- A.11. We can define events Z(h) that correspond generically to the death or birth+maturation of a human h.
To summarise the various ways we can define the importance of S(u) (and hence, by integrating them into S(u)'s code, constrain S(u)'s importance and influence), depending on the axioms we have:

A toy model of the control problem
EDITED based on suggestions for improving the model
Jaan Tallinn has suggested creating a toy model of the control problem, so that it can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple (too simple?) attempt:
A controls B. B manipulates A.
Let B be a robot agent that moves in a two dimensional world, as follows:

How the virtual AI controls itself
A putative new idea for AI control; index here.
In previous posts, I posited AIs caring only about virtual worlds - in fact, being defined as processes in virtual worlds, similarly to cousin_it's idea. How could this go? We would want the AI to reject offers of outside help - be they ways of modifying its virtual world, or ways of giving it extra resources.
Let V be a virtual world, over which a utility function u is defined. The world accepts a single input string O. Let P be a complete specification of an algorithm, including the virtual machine it is run on, the amount of memory it has access to, and so on.
Fix some threshold T for u (to avoid the the subtle weeds of maximising). Define the statement:
r(P,O,V,T): "P(V) returns O, and either E(u|O)>T or O=∅"
And the string valued program:
Q(V,P,T): "If you can find that there exists a non-empty O such that r(P,O,V,T), return O. Else return ∅."
Here "find" and "E" are where the magic-super-intelligence-stuff happens.
Now, it seems to me that Q(V,Q,T) is the program we are looking for. It is uninterested in offers to modify the virtual world, because E(u|O)>T is defined over the unmodified virtual world. We can set it up so that the first thing it proves is something like "If I (ie Q) prove E(u|O)>T, then r(Q,O,V,T)." If we offer it more computing resources, it can no longer make use of that assumption, because "I" will no longer be Q.
Does this seem like a possible way of phrasing the self-containing requirements? For the moment, this seems to make it reject small offers of extra resources, and be indifferent to large offers.
The virtual AI within its virtual world
A putative new idea for AI control; index here.
In a previous post, I talked about an AI operating only on a virtual world (ideas like this used to be popular, until it was realised the AI might still want to take control of the real world to affect the virtual world; however, with methods like indifference, we can guard against this much better).
I mentioned that the more of the AI's algorithm that existed in the virtual world, the better it was. But why not go the whole way? Some people at MIRI and other places are working on agents modelling themselves within the real world. Why not have the AI model itself as an agent inside the virtual world? We can quine to do this, for example.
Then all the restrictions on the AI - memory capacity, speed, available options - can be specified precisely, within the algorithm itself. It will only have the resources of the virtual world to achieve its goals, and this will be specified within it. We could define a "break" in the virtual world (ie any outside interference that the AI could cause, were it to hack us to affect its virtual world) as something that would penalise the AI's achievements, or simply as something impossible according to its model or beliefs. It would really be a case of "given these clear restrictions, find the best approach you can to achieve these goals in this specific world".
It would be idea if the AI's motives were not given in terms of achieving anything in the virtual world, but in terms of making the decisions that, subject to the given restrictions, were most likely to achieve something if the virtual world were run in its entirety. That way the AI wouldn't care if the virtual world were shut down or anything similar. It should only seek to self modify in way that makes sense within the world, and understand itself existing completely within these limitations.
Of course, this would ideally require flawless implementation of the code; we don't want bugs developing in the virtual world that point to real world effects (unless we're really confident we have properly coded the "care only about the what would happen in the virtual world, not what actually does happen).
Any thoughts on this idea?
AI, cure this fake person's fake cancer!
A putative new idea for AI control; index here.
An idea for how an we might successfully get useful work out of a powerful AI.
The ultimate box
Assume that we have an extremely detailed model of a sealed room, with a human in it and enough food, drink, air, entertainment, energy, etc... for the human to survive for a month. We have some medical equipment in the room - maybe a programmable set of surgical tools, some equipment for mixing chemicals, a loud-speaker for communication, and anything else we think might be necessary. All these objects are specified within the model.
We also have some defined input channels into this abstract room, and output channels from this room.
The AI's preferences will be defined entirely with respect to what happens in this abstract room. In a sense, this is the ultimate AI box: instead of taking a physical box and attempting to cut it out from the rest of the universe via hardware or motivational restrictions, we define an abstract box where there is no "rest of the universe" at all.
Cure cancer! Now! And again!
What can we do with such a setup? Well, one thing we could do is to define the human in such a way that they have some from of advanced cancer. We define what "alive and not having cancer" counts as, as well as we can (the definition need not be fully rigorous). Then the AI is motivated to output some series of commands to the abstract room that results in the abstract human inside not having cancer. And, as a secondary part of its goal, it outputs the results of its process.
Against the internal locus of control
What do you think about these pairs of statements?
- People's misfortunes result from the mistakes they make
- Many of the unhappy things in people's lives are partly due to bad luck
- In the long run, people get the respect they deserve in this world.
- Unfortunately, an individual's worth often passes unrecognized no matter how hard he tries.
- Becoming a success is a matter of hard work; luck has little or nothing to do with it.
- Getting a good job mainly depends on being in the right place at the right time.
They have a similar theme: the first statement suggests that an outcome (misfortune, respect, or a good job) for a person are the result of their own action or volition. The second assigns the outcome to some external factor like bad luck.(1)
People who tend to think their own attitudes or efforts can control what happens to them are said to have an internal locus of control, those who don't, an external locus of control. (Call them 'internals' and 'externals' for short).
Internals seem to do better at life, pace obvious confounding: maybe instead of internals doing better by virtue of their internal locus of control, being successful inclines you to attribute success internal factors and so become more internal, and vice versa if you fail.(2) If you don't think the relationship is wholly confounded, then there is some prudential benefit for becoming more internal.
Yet internal versus external is not just a matter of taste, but a factual claim about the world. Do people, in general, get what their actions deserve, or is it generally thanks to matters outside their control?
Why the external view is right
Here are some reasons in favour of an external view:(3)
- Global income inequality is marked (e.g. someone in the bottom 10% of the US population by income is still richer than two thirds of the population - more here). The main predictor of your income is country of birth, it is thought to explain around 60% of the variance: not only more important than any other factor, but more important than all other factors put together.
- Of course, the 'remaining' 40% might not be solely internal factors either. Another external factor we could put in would be parental class. Include that, and the two factors explain 80% of variance in income.
- Even conditional on being born in the right country (and to the right class), success may still not be a matter of personal volition. One robust predictor of success (grades in school, job performance, income, and so on) is IQ. The precise determinants of IQ remain controversial, it is known to be highly heritable, and the 'non-genetic' factors of IQ proposed (early childhood environment, intra-uterine environment, etc.) are similarly outside one's locus of control.
On cursory examination the contours of how our lives are turned out are set by factors outside our control, merely by where we are born and who our parents are. Even after this we know various predictors, similarly outside (or mostly outside) of our control, that exert their effects on how our lives turn out: IQ is one, but we could throw in personality traits, mental health, height, attractiveness, etc.
So the answer to 'What determined how I turned out, compared to everyone else on the planet?', the answer surely has to by primarily about external factors, and our internal drive or will is relegated a long way down the list. Even if we want to look at narrower questions, like "What has made me turn out the way I am, versus all the other people who were likewise born in rich countries in comfortable circumstances?" It is still unclear whether the locus of control resides within our will: perhaps a combination of our IQ, height, gender, race, risk of mental illness and so on will still do the bulk of the explanatory work.(4)
Bringing the true and the prudentially rational together again
If it is the case that folks with an internal locus of control succeed more, yet also the external view being generally closer to the truth of the matter, this is unfortunate. What is true and what is prudentially rational seem to be diverging, such that it might be in your interests not to know about the evidence in support of an external locus of control view, as deluding yourself about an internal locus of control view would lead to your greater success.
Yet it is generally better not to believe falsehoods. Further, the internal view may have some costs. One possibility is fueling a just world fallacy: if one thinks that outcomes are generally internally controlled, then a corollary is when bad things happen to someone or they fail at something, it was primarily their fault rather than them being a victim of circumstance.
So what next? Perhaps the right view is to say that: although most important things are outside our control, not everything is. Insofar as we do the best with what things we can control, we make our lives go better. And the scope of internal factors - albeit conditional on being a rich westerner etc. - may be quite large: it might determine whether you get through medical school, publish a paper, or put in enough work to do justice to your talents. All are worth doing.

Acknowledgements
Inspired by Amanda MacAskill's remarks, and in partial response of Peter McIntyre. Neither are responsible for what I've written, and the former's agreement or the latter's disagreement with this post shouldn't be assumed.
1) Some ground-clearing: free will can begin to loom large here - after all, maybe my actions are just a result of my brain's particular physical state, and my brain's particular physical state at t depends on it's state at t-1, and so on and so forth all the way to the big bang. If so, there is no 'internal willer' for my internal locus of control to reside.
However, even if that is so, we can parse things in a compatibilist way: 'internal' factors are those which my choices can affect; external factors are those which my choices cannot affect. "Time spent training" is an internal factor as to how fast I can run, as (borrowing Hume), if I wanted to spend more time training, I could spend more time training, and vice versa. In contrast, "Hemiparesis secondary to birth injury" is an external factor, as I had no control over whether it happened to me, and no means of reversing it now. So the first set of answers imply support for the results of our choices being more important; whilst the second set assign more weight to things 'outside our control'.
2) In fairness, there's a pretty good story as to why there should be 'forward action': in the cases where outcome is a mix of 'luck' factors (which are a given to anyone), and 'volitional ones' (which are malleable), people inclined to think the internal ones matter a lot will work hard at them, and so will do better when this is mixed in with the external determinants.
3) This ignores edge cases where we can clearly see the external factors dominate - e.g. getting childhood leukaemia, getting struck by lightning etc. - I guess sensible proponents of an internal locus of control would say that there will be cases like this, but for most people, in most cases, their destiny is in their hands. Hence I focus on population level factors.
4) Ironically, one may wonder to what extent having an internal versus external view is itself an external factor.
Crude measures
A putative new idea for AI control; index here.
Partially inspired by as conversation with Daniel Dewey.
People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.
However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.
Crude measures
The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.
To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.
The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.
So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.
In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.
It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.
More or less crude
The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?
The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.
Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.
We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.
How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").
One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).
You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".
There are a few other thoughts we might have for trying to pick a safer u:
- Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
- If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
- Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
- Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
- Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
- Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
- Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
- All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.
This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?
And with that, the various results of my AI retreat are available to all.
Intelligence modules
A putative new idea for AI control; index here.
This idea, due to Eric Drexler, is to separate out the different parts of an AI into modules. There would be clearly designated pieces, either physical or algorithmic, with this part playing a specific role: this module would contain the motivation, this module the probability estimator, this module the models of the outside world, this module the natural language understanding unit, etc...
It's obvious how such a decomposition would be useful for many of the methods I've been detailing here. We could also distil each module - reduce it to a smaller, weaker (?) and more understandable submodule, in order to better understand what is going on. In one scenario, an opaque AI gets to design its successor, in the form of a series of such modules.
This property seems desirable; the question is, how could we get it?
EDIT: part of the idea of "modules" is that AIs often need to do calculations or estimations that would be of great value to us if we could access them in isolation. This idea is developed more in these posts.
Designing in modules
The main threat here is that a given submodule would contain more than just the properties we want. After all, a natural language parser could consist of a general intelligence plus a motivation to understand language. Another possible worry is that the modules are overfitted to the problem or to each other: the language parser works perfectly, but only in this one AI design.
There are several ways we could try and combat this.
- Minimum length to the program (Eric's original main idea). The module must be as short as possible for its purpose.
- If possible, start with a modularised AI, and track how the modules change as the AI's intelligence grows.
- Have the AI extend the module to make it more transparent, without adding any extra functionality.
- Check that the input-output map the AI presents (maybe require it so present it as a simple primitive recursive function with clear and simple "for" loops) is of commensurable complexity to the task at hand.
- Add some noise to the situation: the new AI will not get exactly the same data as the old, the modules may get somewhat altered, and still have to work. Bugs will be introduced (this approach helps to combat the overfitting aspect of minimum length description).
- Require that the module be "distant", according to some metric of modification to the module's program, from having other capabilities. For instance, a general intelligence motivated to understand language is easy to make into a model-making module, simply by changing the motivation; more narrow modules are probably more distant from each other.
- Similarly, require that the performance of the modules becomes degraded if the world (or the world's descriptions) were substantially different, but not if they are the same information presented in a very different way.
- General anti-overfitting methods, calibrated to not generate a push towards general abilities in the modules. This would require some skill and a lot of thought.
- Focus on requiring statistical accuracy, rather than perfect performance.
If we allow false counterfactuals, then we can also:
- Imagine that the modules would be connected up to other modules, made by (hypothetical) different AIs, with potentially different motivations.
- Require that the modules not be useful as seeds for an AI of opposite motivation. For instance, an AI with opposite motivation would be allowed to flip a certain number of bits; if there was a general intelligence piece inside the module, then that would be something easier to co-opt, so dangerous to hand to an AI with opposite motivation. This is the "opposed" version of requiring modules to be "distant" from each other's capabilities.
- In general, all the reduced impact/satisficing requirements push towards having modules that cannot become general intelligences.
- The AI will dictate how to program these modules to a smart programmer, who will (counterfactually) not survive nor get to program the modules in the outside world.
Obviously anti-restriction-hacking would be useful to just module separation (and vice versa).
This is the beginning of the process of defining this, but it would be great to have a safe(ish) method of separating modules in this way.
Any suggestions?
AI-created pseudo-deontology
I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.
We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).
What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.
My rough idea goes something like this:
AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.
What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.
So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.
Do people feel this idea is implementable/improvable?
AI caught by a module that counterfactually doesn't exist
EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.
A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".
I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.
So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).
How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.
But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.
Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.
I wonder if this idea can be extended. Suggestions?
An Introduction to Control Markets
Control markets are systems where the control of resources and the system is determined by a market, the currency of that market is given out dependent upon how well the system as a whole is doing.
They have been discussed and explored in computer systems for a while with Agorics, Learning Classifier systems and Eric Baum's work being two notable examples (ZCS being the closest LCS to it). These are limited and constrained markets, in that the communication and computational expressiveness of them are limited. However unlimited control markets may be of use in the real world to control an organization in a flexible fashion. For example it might be useful for shareholders to control a board or possibly the whole company. Even if it isn't compatible with human motivational systems and real world conditions, discussing and thinking about these sorts of systems may enable us to find better organizational structures than our current ones.
Control market can be seen as trying to embed reinforcement learning into an organization.
The challenges of bringing up AIs
At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).
The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between:
- An AGI being trained to be friendly.
- An AGI being trained to lie and conceal.
- An AGI that will behave completely differently once out of the training/testing/trust-building environment.
- An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.
Brain storm: What is the theory behind a good political mechanism?
Patrissimo argue that we should try to design good mechanisms for governance rather than try and use the current broken mechanisms.
I agree, however we don't have a theoretical framework that we can use to evaluate different systems that are proposed. Ideally we would be able to crunch some numbers and show that a Futarchy responds to the desires/needs of the populace better than "voting for politicians who then make decisions" or anything else we come up with.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)