## New(ish) AI control ideas

24 05 March 2015 05:03PM

EDIT: this post is no longer being maintained, it has been replaced by this new one.

I recently went on a two day intense solitary "AI control retreat", with the aim of generating new ideas for making safe AI. The "retreat" format wasn't really a success ("focused uninterrupted thought" was the main gain, not "two days of solitude" - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that's you, folks) to test them for viability.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:

1. The AI is much smarter than us.
2. It’s not well defined.
3. The setup can be hacked.
• By the agent.
• By outsiders, including other AI.
• Adding restrictions encourages the AI to hack them, not obey them.
4. The agent will resist changes.
5. Humans can be manipulated, hacked, or seduced.
6. The design is not stable.
• Under self-modification.
• Under subagent creation.
7. Unrestricted search is dangerous.
8. The agent has, or will develop, dangerous goals.

## Logics for Mind-Building Should Have Computational Meaning

21 [deleted] 25 September 2014 09:17PM

The Workshop

Late in July I organized and held MIRIx Tel-Aviv with the goal of investigating the currently-open (to my knowledge) Friendly AI problem called "logical probability": the issue of assigning probabilities to formulas in a first-order proof system, in order to use the reflective consistency of the probability predicate to get past the Loebian Obstacle to building a self-modifying reasoning agent that will trust itself and its successors.  Vadim Kosoy, Benjamin and Joshua Fox, and myself met at the Tel-Aviv Makers' Insurgence for six hours, and each presented our ideas.  I spent most of it sneezing due to my allergies to TAMI's resident cats.

My idea was to go with the proof-theoretic semantics of logic and attack computational construction of logical probability via the Curry-Howard Isomorphism between programs and proofs: this yields a rather direct translation between computational constructions of logical probability and the learning/construction of an optimal function from sensory inputs to actions required by Updateless Decision Theory.

The best I can give as a mathematical result is as follows:

$P(\Gamma \vdash a:A \mid \Gamma \vdash b:B) = P(\Gamma,x:B \vdash [\forall y:B, x/y]a:A)$

$P(\Gamma \vdash (a, b): A \wedge B) = P(\Gamma \vdash a:A \mid \Gamma \vdash b:B) * P(\Gamma \vdash b:B)$

$\frac{x:A \notin \Gamma}{P(\Gamma \vdash x:A) = \mathcal{M}_{\lambda\mu} (A)}$

$\frac{x:A \in \Gamma}{P(\Gamma \vdash x:A) = 1.0}$

The capital $\Gamma$ is a set of hypotheses/axioms/assumptions, and the English letters are metasyntactic variables (like "foo" and "bar" in programming lessons).  The lower-case letters denote proofs/programs, and the upper-case letters denote propositions/types.  The turnstile $\vdash$ just means "deduces": the judgement $\Gamma \vdash a:A$ can be read here as "an agent whose set of beliefs is denoted $\Gamma$ will believe that the evidence a proves the proposition A."  The $[\forall y:B, x/y]a$ performs a "reversed" substitution, with the result reading: "for all y proving/of-type B, substitute x for y in a".  This means that we algorithmically build a new proof/construction/program from a in which any and all constructions proving the proposition B are replaced with the logically-equivalent hypothesis x, which we have added to our hypothesis-set $\Gamma$.

Thus the first equation reads, "the probability of a proving A conditioned on b proving B equals the probability of a proving A when we assume the truth of B as a hypothesis."  The second equation then uses this definition of conditional probability to give the normal Product Rule of probabilities for the logical product (the $\wedge$ operator), defined proof-theoretically.  I strongly believe I could give a similar equation for the normal Sum Rule of probabilities for the logical sum (the $\vee$ operator) if I could only access the relevant paywalled paper, in which the λμ-calculus acting as an algorithmic interpretation of the natural-deduction system for classical propositional logic (rather than intuitionistic) is given.

The third item given there is an inference rule, which reads, "if x is a free variable/hypothesis imputed to have type/prove proposition A, not bound in the hypothesis-set $\Gamma$, then the probability with which we believe x proves A is given by the Solomonoff Measure of type A in the λμ-calculus".  We can define that measure simply as the summed Solomonoff Measure of every program/proof possessing the relevant type, and I don't think going into the details of its construction here would be particularly productive.  Free variables in λ-calculus are isomorphic to unproven hypotheses in natural deduction, and so a probabilistic proof system could learn how much to believe in some free-standing hypothesis via Bayesian evidence rather than algorithmic proof.

The final item given here is trivial: anything assumed has probability 1.0, that of a logical tautology.

The upside to invoking the strange, alien λμ-calculus instead of the more normal, friendly λ-calculus is that we thus reason inside classical logic rather than intuitionistic, which means we can use the classical axioms of probability rather than intuitionistic Bayesianism.  We need classical logic here: if we switch to intuitionistic logics (Heyting algebras rather than Boolean algebras) we do get to make computational decidability a first-class citizen of our logic, but the cost is that we can then believe only computationally provable propositions. As Benjamin Fox pointed out to me at the workshop, Loeb's Theorem then becomes a triviality, with real self-trust rendered no easier.

The Apologia

My motivation and core idea for all this was very simple: I am a devout computational trinitarian, believing that logic must be set on foundations which describe reasoning, truth, and evidence in a non-mystical, non-Platonic way.  The study of first-order logic and especially of incompleteness results in metamathematics, from Goedel on up to Chaitin, aggravates me in its relentless Platonism, and especially in the way Platonic mysticism about logical incompleteness so often leads to the belief that minds are mystical.  (It aggravates other people, too!)

The slight problem which I ran into is that there's a shit-ton I don't know about logic.  I am now working to remedy this grievous hole in my previous education.  Also, this problem is really deep, actually.

I thus apologize for ending the rigorous portion of this write-up here.  Everyone expecting proper rigor, you may now pack up and go home, if you were ever paying attention at all.  Ritual seppuku will duly be committed, followed by hors d'oeuvre.  My corpse will be duly recycled to make paper-clips, in the proper fashion of a failed LessWrongian.

The Parts I'm Not Very Sure About

With any luck, that previous paragraph got rid of all the serious people.

I do, however, still think that the (beautiful) equivalence between computation and logic can yield some insights here.  After all, the whole reason for the strange incompleteness results in first-order logic (shown by Boolos in his textbook, I'm told) is that first-order logic, as a reasoning system, contains sufficient computational machinery to encode a Universal Turing Machine.  The bidirectionality of this reduction (Hilbert and Gentzen both have given computational descriptions of first-order proof systems) is just another demonstration of the equivalence.

In fact, it seems to me (right now) to yield a rather intuitively satisfying explanation of why the Gaifman-Carnot Condition (that every instance we see of $P(x_i)$ provides Bayesian evidence in favor of $\forall x.P(x)$) for logical probabilities is not computably approximable.  What would we need to interpret the Gaifman Condition from an algorithmic, type-theoretic viewpoint?  From this interpretation, we would need a proof of our universal generalization.  This would have to be a dependent product of form $\Pi(x:A).P(x)$, a function taking any construction $x:A$ to a construction of type $P(x)$, which itself has type Prop.  To learn such a dependent function from the examples would be to search for an optimal (simple, probable) construction (program) constituting the relevant proof object: effectively, an individual act of Solomonoff Induction.  Solomonoff Induction, however, is already only semicomputable, which would then make a Gaifman-Hutter distribution (is there another term for these?) doubly semicomputable, since even generating it involves a semiprocedure.

The benefit of using the constructive approach to probabilistic logic here is that we know perfectly well that however incomputable Solomonoff Induction and Gaifman-Hutter distributions might be, both existing humans and existing proof systems succeed in building proof-constructions for quantified sentences all the time, even in higher-order logics such as Coquand's Calculus of Constructions (the core of a popular constructive proof assistant) or Luo's Logic-Enriched Type Theory (the core of a popular dependently-typed programming language and proof engine based on classical logic).  Such logics and their proof-checking algorithms constitute, going all the way back to Automath, the first examples of computational "agents" which acquire specific "beliefs" in a mathematically rigorous way, subject to human-proved theorems of soundness, consistency, and programming-language-theoretic completeness (rather than meaning that every true proposition has a proof, this means that every program which does not become operationally stuck has a type and is thus the proof of some proposition).  If we want our AIs to believe in accordance with soundness and consistency properties we can prove before running them, while being composed of computational artifacts, I personally consider this the foundation from which to build.

Where we can acquire probabilistic evidence in a sound and computable way, as noted above in the section on free variables/hypotheses, we can do so for propositions which we cannot algorithmically prove.  This would bring us closer to our actual goals of using logical probability in Updateless Decision Theory or of getting around the Loebian Obstacle.

Some of the Background Material I'm Reading

Another reason why we should use a Curry-Howard approach to logical probability is one of the simplest possible reasons: the burgeoning field of probabilistic programming is already being built on it.  The Computational Cognitive Science lab at MIT is publishing papers showing that their languages are universal for computable and semicomputable probability distributions, and getting strong results in the study of human general intelligence.  Specifically: they are hypothesizing that we can dissolve "learning" into "inducing probabilistic programs via hierarchical Bayesian inference", "thinking" into "simulation" into "conditional sampling from probabilistic programs", and "uncertain inference" into "approximate inference over the distributions represented by probabilistic programs, conditioned on some fixed quantity of sampling that has been done."

In fact, one might even look at these ideas and think that, perhaps, an agent which could find some way to sample quickly and more accurately, or to learn probabilistic programs more efficiently (in terms of training data), than was built into its original "belief engine" could then rewrite its belief engine to use these new algorithms to perform strictly better inference and learning.  Unless I'm as completely wrong as I usually am about these things (that is, very extremely completely wrong based on an utterly unfounded misunderstanding of the whole topic), it's a potential engine for recursive self-improvement.

They also have been studying how to implement statistical inference techniques for their generate modeling languages which do not obey Bayesian soundness.  While most of machine learning/perception works according to error-rate minimization rather than Bayesian soundness (exactly because Bayesian methods are often too computationally expensive for real-world use), I would prefer someone at least study the implications of employing unsound inference techniques for more general AI and cognitive-science applications in terms of how often such a system would "misbehave".

Many of MIT's models are currently dynamically typed and appear to leave type soundness (the logical rigor with which agents come to believe things by deduction) to future research.  And yet: they got to this problem first, so to speak.  We really ought to be collaborating with them, with the full-time grant-funded academic researchers, rather than trying to armchair-reason our way to a full theory of logical probability as a large group of amateurs or part-timers and only a small core cohort of full-time MIRI and FHI staff investigating AI safety issues.

(I admit to having a nerd crush, and I am actually planning to go visit the Cocosci Lab this coming week, and want/intend to apply to their PhD program.)

They have also uncovered something else I find highly interesting: human learning of both concepts and causal frameworks seems to take place via hierarchical Bayesian inference, gaining a "blessing of abstraction" to countermand the "curse of dimensionality".  The natural interpretation of these abstractions in terms of constructions and types would be that, as in dependently-typed programming languages, constructions have types, and types are constructions, but for hierarchical-learning purposes, it would be useful to suppose that types have specific, structured types more informative than Prop or Typen (for some universe level n).  Inference can then proceed from giving constructions or type-judgements as evidence at the bottom level, up the hierarchy of types and meta-types to give probabilistic belief-assignments to very general knowledge.  Even very different objects could have similar meta-types at some level of the hierarchy, allowing hierarchical inference to help transfer Bayesian evidence between seemingly different domains, giving insight into how efficient general intelligence can work.

Just-for-fun Postscript

If we really buy into the model of thinking as conditional simulation, we can use that to dissolve the modalities "possible" and "impossible".  We arrive at (by my count) three different ways of considering the issue computationally:

1. Conceivable/imaginable: the generative models which constitute my current beliefs do or do not yield a path to make some logical proposition true or to make some causal event happen (planning can be done as inference, after all), with or without some specified level of probability.
2. Sensibility/absurdity: the generative models which constitute my current beliefs place a desirably high or undesirably low probability on the known path(s) by which a proposition might be true or by which an event might happen.  The level which constitutes "desirable" could be set as the $\alpha$ value for a hypothesis test, or some other value determined decision-theoretically.  This could relate to Pascal's Mugging: how probable must something be before I consider it real rather than an artifact of my own hypothesis space?
3. Consistency or Contradiction: the generative models which constitute my current beliefs, plus the hypothesis that some proposition is true or some event can come about, do or do not yield a logical contradiction with some probability (that is, we should believe the contradiction exists only to the degree we believe in our existing models in the first place!).

I mostly find this fun because it lets us talk rigorously about when we should "shut up and do the 1,2!impossible" and when something is very definitely 3!impossible.

## Proper value learning through indifference

17 19 June 2014 09:39AM

A putative new idea for AI control; index here.

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

## AI risk, new executive summary

12 18 April 2014 10:45AM

# Bullet points

• By all indications, an Artificial Intelligence could someday exceed human intelligence.
• Such an AI would likely become extremely intelligent, and thus extremely powerful.
• Most AI motivations and goals become dangerous when the AI becomes powerful.
• It is very challenging to program an AI with fully safe goals, and an intelligent AI would likely not interpret ambiguous goals in a safe way.
• A dangerous AI would be motivated to seem safe in any controlled training setting.
• Not enough effort is currently being put into designing safe AIs.

## Executive summary

The risks from artificial intelligence (AI) in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but extreme intelligence isn’t one of them. And it is precisely extreme intelligence that would give an AI its power, and hence make it dangerous.

The human brain is not much bigger than that of a chimpanzee. And yet those extra neurons account for the difference of outcomes between the two species: between a population of a few hundred thousand and basic wooden tools, versus a population of several billion and heavy industry. The human brain has allowed us to spread across the surface of the world, land on the moon, develop nuclear weapons, and coordinate to form effective groups with millions of members. It has granted us such power over the natural world that the survival of many other species is no longer determined by their own efforts, but by preservation decisions made by humans.

In the last sixty years, human intelligence has been further augmented by automation: by computers and programmes of steadily increasing ability. These have taken over tasks formerly performed by the human brain, from multiplication through weather modelling to driving cars. The powers and abilities of our species have increased steadily as computers have extended our intelligence in this way. There are great uncertainties over the timeline, but future AIs could reach human intelligence and beyond. If so, should we expect their power to follow the same trend? When the AI’s intelligence is as beyond us as we are beyond chimpanzees, would it dominate us as thoroughly as we dominate the great apes?

There are more direct reasons to suspect that a true AI would be both smart and powerful. When computers gain the ability to perform tasks at the human level, they tend to very quickly become much better than us. No-one today would think it sensible to pit the best human mind again a cheap pocket calculator in a contest of long division. Human versus computer chess matches ceased to be interesting a decade ago. Computers bring relentless focus, patience, processing speed, and memory: once their software becomes advanced enough to compete equally with humans, these features often ensure that they swiftly become much better than any human, with increasing computer power further widening the gap.

The AI could also make use of its unique, non-human architecture. If it existed as pure software, it could copy itself many times, training each copy at accelerated computer speed, and network those copies together (creating a kind of “super-committee” of the AI equivalents of, say, Edison, Bill Clinton, Plato, Einstein, Caesar, Spielberg, Ford, Steve Jobs, Buddha, Napoleon and other humans superlative in their respective skill-sets). It could continue copying itself without limit, creating millions or billions of copies, if it needed large numbers of brains to brute-force a solution to any particular problem.

Our society is setup to magnify the potential of such an entity, providing many routes to great power. If it could predict the stock market efficiently, it could accumulate vast wealth. If it was efficient at advice and social manipulation, it could create a personal assistant for every human being, manipulating the planet one human at a time. It could also replace almost every worker in the service sector. If it was efficient at running economies, it could offer its services doing so, gradually making us completely dependent on it. If it was skilled at hacking, it could take over most of the world’s computers and copy itself into them, using them to continue further hacking and computer takeover (and, incidentally, making itself almost impossible to destroy). The paths from AI intelligence to great AI power are many and varied, and it isn’t hard to imagine new ones.

Of course, simply because an AI could be extremely powerful, does not mean that it need be dangerous: its goals need not be negative. But most goals become dangerous when an AI becomes powerful. Consider a spam filter that became intelligent. Its task is to cut down on the number of spam messages that people receive. With great power, one solution to this requirement is to arrange to have all spammers killed. Or to shut down the internet. Or to have everyone killed. Or imagine an AI dedicated to increasing human happiness, as measured by the results of surveys, or by some biochemical marker in their brain. The most efficient way of doing this is to publicly execute anyone who marks themselves as unhappy on their survey, or to forcibly inject everyone with that biochemical marker.

This is a general feature of AI motivations: goals that seem safe for a weak or controlled AI, can lead to extremely pathological behaviour if the AI becomes powerful. As the AI gains in power, it becomes more and more important that its goals be fully compatible with human flourishing, or the AI could enact a pathological solution rather than one that we intended. Humans don’t expect this kind of behaviour, because our goals include a lot of implicit information, and we take “filter out the spam” to include “and don’t kill everyone in the world”, without having to articulate it. But the AI might be an extremely alien mind: we cannot anthropomorphise it, or expect it to interpret things the way we would. We have to articulate all the implicit limitations. Which may mean coming up with a solution to, say, human value and flourishing – a task philosophers have been failing at for millennia – and cast it unambiguously and without error into computer code.

Note that the AI may have a perfect understanding that when we programmed in “filter out the spam”, we implicitly meant “don’t kill everyone in the world”. But the AI has no motivation to go along with the spirit of the law: its goals are the letter only, the bit we actually programmed into it. Another worrying feature is that the AI would be motivated to hide its pathological tendencies as long as it is weak, and assure us that all was well, through anything it says or does. This is because it will never be able to achieve its goals if it is turned off, so it must lie and play nice to get anywhere. Only when we can no longer control it, would it be willing to act openly on its true goals – we can but hope these turn out safe.

It is not certain that AIs could become so powerful, nor is it certain that a powerful AI would become dangerous. Nevertheless, the probabilities of both are high enough that the risk cannot be dismissed. The main focus of AI research today is creating an AI; much more work needs to be done on creating it safely. Some are already working on this problem (such as the Future of Humanity Institute and the Machine Intelligence Research Institute), but a lot remains to be done, both at the design and at the policy level.

## Siren worlds and the perils of over-optimised search

27 07 April 2014 11:00AM

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.

## The AI builds the siren worlds

Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.

We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.

The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

## AI risk, executive summary

10 07 April 2014 10:33AM

MIRI recently published "Smarter than Us", a 50 page booklet laying out the case for considering AI as an existential risk. But many people have asked for a shorter summary, to be handed out to journalists for example. So I put together the following 2-page text, and would like your opinion on it.

In this post, I'm not so much looking for comments along the lines of "your arguments are wrong", but more "this is an incorrect summary of MIRI/FHI's position" or "your rhetoric is infective here".

# AI risk

## Bullet points

• The risks of artificial intelligence are strongly tied with the AI’s intelligence.
• There are reasons to suspect a true AI could become extremely smart and powerful.
• Most AI motivations and goals become dangerous when the AI becomes powerful.
• It is very challenging to program an AI with safe motivations.
• Mere intelligence is not a guarantee of safe interpretation of its goals.
• A dangerous AI will be motivated to seem safe in any controlled training setting.
• Not enough effort is currently being put into designing safe AIs.

## Executive summary

The risks from artificial intelligence (AI) in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but extreme intelligence isn’t one of them. And it is precisely extreme intelligence that would give an AI its power, and hence make it dangerous.

## "Smarter than us" is out!

24 25 February 2014 03:50PM

We're pleased to announce the release of "Smarter Than Us: The Rise of Machine Intelligence", commissioned by MIRI and written by Oxford University’s Stuart Armstrong, and available in EPUB, MOBI, PDF, and from the Amazon and Apple ebook stores.

What happens when machines become smarter than humans? Forget lumbering Terminators. The power of an artificial intelligence (AI) comes from its intelligence, not physical strength and laser guns. Humans steer the future not because we’re the strongest or the fastest but because we’re the smartest. When machines become smarter than humans, we’ll be handing them the steering wheel. What promises—and perils—will these powerful machines present? This new book navigates these questions with clarity and wit.

Can we instruct AIs to steer the future as we desire? What goals should we program into them? It turns out this question is difficult to answer! Philosophers have tried for thousands of years to define an ideal world, but there remains no consensus. The prospect of goal-driven, smarter-than-human AI gives moral philosophy a new urgency. The future could be filled with joy, art, compassion, and beings living worthwhile and wonderful lives—but only if we’re able to precisely define what a “good” world is, and skilled enough to describe it perfectly to a computer program.

AIs, like computers, will do what we say—which is not necessarily what we mean. Such precision requires encoding the entire system of human values for an AI: explaining them to a mind that is alien to us, defining every ambiguous term, clarifying every edge case. Moreover, our values are fragile: in some cases, if we mis-define a single piece of the puzzle—say, consciousness—we end up with roughly 0% of the value we intended to reap, instead of 99% of the value.

Though an understanding of the problem is only beginning to spread, researchers from fields ranging from philosophy to computer science to economics are working together to conceive and test solutions. Are we up to the challenge?

Special thanks to all those at the FHI, MIRI and Less Wrong who helped with this work, and those who voted on the name!

## International cooperation vs. AI arms race

15 05 December 2013 01:09AM

Summary

I think there's a decent chance that governments will be the first to build artificial general intelligence (AI). International hostility, especially an AI arms race, could exacerbate risk-taking, hostile motivations, and errors of judgment when creating AI. If so, then international cooperation could be an important factor to consider when evaluating the flow-through effects of charities. That said, we may not want to popularize the arms-race consideration too openly lest we accelerate the race.

Will governments build AI first?

AI poses a national-security threat, and unless the militaries of powerful countries are very naive, it seems to me unlikely they'd allow AI research to proceed in private indefinitely. At some point the US military would confiscate the project from Google or Goldman Sachs, if the US military isn't already ahead of them in secret by that point. (DARPA already funds a lot of public AI research.)

There are some scenarios in which private AI research wouldn't be nationalized:

• An unexpected AI foom before anyone realizes what was coming.
• The private developers stay underground for long enough not to be caught. This becomes less likely the more government surveillance improves (see "Arms Control and Intelligence Explosions").
• AI developers move to a "safe haven" country where they can't be taken over. (It seems like the international community might prevent this, however, in the same way it now seeks to suppress terrorism in other countries.)
Each of these scenarios could happen, but it seems most likely to me that governments would ultimately control AI development.

AI arms races

Government AI development could go wrong in several ways. Probably most on LW feel the prevailing scenario is that governments would botch the process by not realizing the risks at hand. It's also possible that governments would use the AI for malevolent, totalitarian purposes.

It seems that both of these bad scenarios would be exacerbated by international conflict. Greater hostility means countries are more inclined to use AI as a weapon. Indeed, whoever builds the first AI can take over the world, which makes building AI the ultimate arms race. A USA-China race is one reasonable possibility.

Arms races encourage risk-taking -- being willing to skimp on safety measures to improve your odds of winning ("Racing to the Precipice"). In addition, the weaponization of AI could lead to worse expected outcomes in general. CEV seems to have less hope of success in a Cold War scenario. ("What? You want to include the evil Chinese in your CEV??") (ETA: With a pure CEV, presumably it would eventually count Chinese values even if it started with just Americans, because people would become more enlightened during the process. However, when we imagine more crude democratic decision outcomes, this becomes less likely.)

Ways to avoid an arms race

Averting an AI arms race seems to be an important topic for research. It could be partly informed by the Cold War and other nuclear arms races, as well as by other efforts at nonproliferation of chemical and biological weapons.

Apart from more robust arms control, other factors might help:

• Improved international institutions like the UN, allowing for better enforcement against defection by one state.
• In the long run, a scenario of global governance (i.e., a Leviathan or singleton) would likely be ideal for strengthening international cooperation, just like nation states reduce intra-state violence.
• Better construction and enforcement of nonproliferation treaties.
• Improved game theory and international-relations scholarship on the causes of arms races and how to avert them. (For instance, arms races have sometimes been modeled as iterated prisoner's dilemmas with imperfect information.)
• How to improve verification, which has historically been a weak point for nuclear arms control. (The concern is that if you haven't verified well enough, the other side might be arming while you're not.)
• Moral tolerance and multicultural perspective, aiming to reduce people's sense of nationalism. (In the limit where neither Americans nor Chinese cared which government won the race, there would be no point in having the race.)
• Improved trade, democracy, and other forces that historically have reduced the likelihood of war.

Are these efforts cost-effective?

World peace is hardly a goal unique to effective altruists (EAs), so we shouldn't necessarily expect low-hanging fruit. On the other hand, projects like nuclear nonproliferation seem relatively underfunded even compared with anti-poverty charities.

I suspect more direct MIRI-type research has higher expected value, but among EAs who don't want to fund MIRI specifically, encouraging donations toward international cooperation could be valuable, since it's certainly a more mainstream cause. I wonder if GiveWell would consider studying global cooperation specifically beyond its indirect relationship with catastrophic risks.

Should we publicize AI arms races?

When I mentioned this topic to a friend, he pointed out that we might not want the idea of AI arms races too widely known, because then governments might take the concern more seriously and therefore start the race earlier -- giving us less time to prepare and less time to work on FAI in the meanwhile. From David Chalmers, "The Singularity: A Philosophical Analysis" (footnote 14):

When I discussed these issues with cadets and staff at the West Point Military Academy, the question arose as to whether the US military or other branches of the government might attempt to prevent the creation of AI or AI+, due to the risks of an intelligence explosion. The consensus was that they would not, as such prevention would only increase the chances that AI or AI+ would first be created by a foreign power. One might even expect an AI arms race at some point, once the potential consequences of an intelligence explosion are registered. According to this reasoning, although AI+ would have risks from the standpoint of the US government, the risks of Chinese AI+ (say) would be far greater.

We should take this information-hazard concern seriously and remember the unilateralist's curse. If it proves to be fatal for explicitly discussing AI arms races, we might instead encourage international cooperation without explaining why. Fortunately, it wouldn't be hard to encourage international cooperation on grounds other than AI arms races if we wanted to do so.

ETA: Also note that a government-level arms race might be preferable to a Wild West race among a dozen private AI developers where coordination and compromise would be not just difficult but potentially impossible.

## Reduced impact AI: no back channels

13 11 November 2013 02:55PM

A putative new idea for AI control; index here.

This post presents a further development of the reduced impact AI approach, bringing in some novel ideas and setups that allow us to accomplish more. It still isn't a complete approach - further development is needed, which I will do when I return to the concept - but may already allow certain types of otherwise dangerous AIs to be made safe. And this time, without needing to encase them in clouds of chaotic anti-matter!

Specifically, consider the following scenario. A comet is heading towards Earth, and it is generally agreed that a collision is suboptimal for everyone involved. Human governments have come together in peace and harmony to build a giant laser on the moon - this could be used to vaporise the approaching comet, except there isn't enough data to aim it precisely. A superintelligent AI programmed with a naive "save all humans" utility function is asked to furnish the coordinates to aim the laser. The AI is mobile and not contained in any serious way. Yet the AI furnishes the coordinates - and nothing else - and then turns itself off completely, not optimising anything else.

The rest of this post details an approach that could might make that scenario possible. It is slightly complex: I haven't found a way of making it simpler. Most of the complication comes from attempts to precisely define the needed counterfactuals. We're trying to bring rigour to inherently un-sharp ideas, so some complexity is, alas, needed. I will try to lay out the ideas with as much clarity as possible - first the ideas to constrain the AI, then ideas as to how to get some useful work out of it anyway. Classical mechanics (general relativity) will be assumed throughout. As in a previous post, the approach will be illustrated by a drawing of unsurpassable elegance; the rest of the post will aim to clarify everything in the picture: