Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

[LINK] David Deutsch on why we don't have AGI yet "Creative Blocks"

2 harshhpareek 17 December 2013 07:03AM

Folks here should be familiar with most of these arguments. Putting some interesting quotes below:

http://aeon.co/magazine/being-human/david-deutsch-artificial-intelligence/

"Creative blocks: The very laws of physics imply that artificial intelligence must be possible. What's holding us up?"

Remember the significance attributed to Skynet’s becoming ‘self-aware’? [...] The fact is that present-day software developers could straightforwardly program a computer to have ‘self-awareness’ in the behavioural sense — for example, to pass the ‘mirror test’ of being able to use a mirror to infer facts about itself — if they wanted to. [...] AGIs will indeed be capable of self-awareness — but that is because they will be General

Some hope to learn how we can rig their programming to make [AGIs] constitutionally unable to harm humans (as in Isaac Asimov’s ‘laws of robotics’), or to prevent them from acquiring the theory that the universe should be converted into paper clips (as imagined by Nick Bostrom). None of these are the real problem. It has always been the case that a single exceptionally creative person can be thousands of times as productive — economically, intellectually or whatever — as most people; and that such a person could do enormous harm were he to turn his powers to evil instead of good.[...] The battle between good and evil ideas is as old as our species and will go on regardless of the hardware on which it is running

He also says confusing things about induction being inadequate for creativity which I'm guessing he couldn't support well in this short essay (perhaps he explains better in his books). Not quoting here. His attack on Bayesianism as an explanation for intelligence is valid and interesting, but could be wrong. Given what we know about neural networks, something like this does happen in the brain, and possibly even at a concept level. 

The doctrine assumes that minds work by assigning probabilities to their ideas and modifying those probabilities in the light of experience as a way of choosing how to act. This is especially perverse when it comes to an AGI’s values — the moral and aesthetic ideas that inform its choices and intentions — for it allows only a behaviouristic model of them, in which values that are ‘rewarded’ by ‘experience’ are ‘reinforced’ and come to dominate behaviour while those that are ‘punished’ by ‘experience’ are extinguished. As I argued above, that behaviourist, input-output model is appropriate for most computer programming other than AGI, but hopeless for AGI.

His final conclusions are disagreeable. He somehow concludes that the principal bottleneck in AGI research is a philosophical one. 

In his last paragraph, he makes the following controversial statement:

For yet another consequence of understanding that the target ability is qualitatively different is that, since humans have it and apes do not, the information for how to achieve it must be encoded in the relatively tiny number of differences between the DNA of humans and that of chimpanzees.

This would be false if, for example, the mother controls gene expression while a foetus develops and helps shape the brain. We should be able to answer this question definitively once we can grow human babies completely in vitro. Another problem would be the impact of the cultural environment. A way to answer this question would be to see if our Stone Age ancestors would be classified as AGIs under a reasonable definition

Autism, Watson, the Turing test, and General Intelligence

6 Stuart_Armstrong 24 September 2013 11:00AM

Thinking aloud:

Humans are examples of general intelligence - the only example we're sure of. Some humans have various degrees of autism (low level versions are quite common in the circles I've moved in), impairing their social skills. Mild autists nevertheless remain general intelligences, capable of demonstrating strong cross domain optimisation. Psychology is full of other examples of mental pathologies that impair certain skills, but nevertheless leave their sufferers as full fledged general intelligences. This general intelligence is not enough, however, to solve their impairments.

Watson triumphed on Jeopardy. AI scientists in previous decades would have concluded that to do so, a general intelligence would have been needed. But that was not the case at all - Watson is blatantly not a general intelligence. Big data and clever algorithms were all that were needed. Computers are demonstrating more and more skills, besting humans in more and more domains - but still no sign of general intelligence. I've recently developed the suspicion that the Turing test (comparing AI with a standard human) could get passed by a narrow AI finely tuned to that task.

The general thread is that the link between narrow skills and general intelligence may not be as clear as we sometimes think. It may be that narrow skills are sufficiently diverse and unique that a mid-level general intelligence may not be able to develop them to a large extent. Or, put another way, an above-human social intelligence may not be able to control a robot body or do decent image recognition. A super-intelligence likely could: ultimately, general intelligence includes the specific skills. But his "ultimately" may take a long time to come.

So the questions I'm wondering about are:

  1. How likely is it that a general intelligence, above human in some domain not related to AI development, will acquire high level skills in unrelated areas?
  2. By building high-performance narrow AIs, are we making it much easier for such an intelligence to develop such skills, by co-opting or copying these programs?

 

Evaluating the feasibility of SI's plan

25 JoshuaFox 10 January 2013 08:17AM

(With Kaj Sotala)

SI's current R&D plan seems to go as follows: 

1. Develop the perfect theory.
2. Implement this as a safe, working, Artificial General Intelligence -- and do so before anyone else builds an AGI.

The Singularity Institute is almost the only group working on friendliness theory (although with very few researchers). So, they have the lead on Friendliness. But there is no reason to think that they will be ahead of anyone else on the implementation.

The few AGI designs we can look at today, like OpenCog, are big, messy systems which intentionally attempt to exploit various cognitive dynamics that might combine in unexpected and unanticipated ways, and which have various human-like drives rather than the sort of supergoal-driven, utility-maximizing goal hierarchies that Eliezer talks about, or which a mathematical abstraction like AIXI employs.

A team which is ready to adopt a variety of imperfect heuristic techniques will have a decisive lead on approaches based on pure theory. Without the constraint of safety, one of them will beat SI in the race to AGI. SI cannot ignore this. Real-world, imperfect, safety measures for real-world, imperfect AGIs are needed.  These may involve mechanisms for ensuring that we can avoid undesirable dynamics in heuristic systems,  or AI-boxing toolkits usable in the pre-explosion stage, or something else entirely. 

SI’s hoped-for theory will include a reflexively consistent decision theory, something like a greatly refined Timeless Decision Theory.  It will also describe human value as formally as possible, or at least describe a way to pin it down precisely, something like an improved Coherent Extrapolated Volition.

The hoped-for theory is intended to  provide not only safety features, but also a description of the implementation, as some sort of ideal Bayesian mechanism, a theoretically perfect intelligence.

SIers have said to me that SI's design will have a decisive implementation advantage. The idea is that because strap-on safety can’t work, Friendliness research necessarily involves more fundamental architectural design decisions, which also happen to be general AGI design decisions that some other AGI builder could grab and save themselves a lot of effort. The assumption seems to be that all other designs are based on hopelessly misguided design principles. SI-ers, the idea seems to go, are so smart that they'll  build AGI far before anyone else. Others will succeed only when hardware capabilities allow crude near-brute-force methods to work.

Yet even if the Friendliness theory provides the basis for intelligence, the nitty-gritty of SI’s implementation will still be far away, and will involve real-world heuristics and other compromises.

We can compare SI’s future AI design to AIXI, another mathematically perfect AI formalism (though it has some critical reflexivity issues). Schmidhuber, Hutter, and colleagues think that their AXI can be scaled down into a feasible implementation, and have implemented some toy systems. Similarly, any actual AGI based on SI's future theories will have to stray far from its mathematically perfected origins.

Moreover, SI's future friendliness proof may simply be wrong. Eliezer writes a lot about logical uncertainty, the idea that you must treat even purely mathematical ideas with same probabilistic techniques as any ordinary uncertain belief. He pursues this mostly so that his AI can reason about itself, but the same principle applies to Friendliness proofs as well.

Perhaps Eliezer thinks that a heuristic AGI is absolutely doomed to failure; that a hard takeoff  immediately soon after the creation of the first AGI is so overwhelmingly likely that a mathematically designed AGI is the only one that could stay Friendly. In that case, we have to work on a pure-theory approach, even if it has a low chance of being finished first. Otherwise we'll be dead anyway. If an embryonic AGI will necessarily undergo an intelligence explosion, we have no choice but to "shut up and do the impossible."

I am all in favor of gung-ho knife-between-the teeth projects. But when you think that your strategy is impossible, then you should also look for a strategy which is possible, if only as a fallback. Thinking about safety theory until drops of blood appear on your forehead (as Eliezer puts it, quoting Gene Fowler), is all well and good. But if there is only a 10% chance of achieving 100% safety (not that there really is any such thing), then I'd rather go for a strategy that provides only a 40% promise of safety, but with a 40% chance of achieving it. OpenCog and the like are going to be developed regardless, and probably before SI's own provably friendly AGI. So, even an imperfect safety measure is better than nothing.

If heuristic approaches have a 99% chance of an immediate unfriendly explosion, then that might be wrong. But SI, better than anyone, should know that any intuition-based probability estimate of “99%” really means “70%”. Even if other approaches are long-shots, we should not put all our eggs in one basket. Theoretical perfection and stopgap safety measures can be developed in parallel.

Given what we know about human overconfidence and the general reliability of predictions, the actual outcome will to a large extent be something that none of us ever expected or could have predicted. No matter what happens, progress on safety mechanisms for heuristic AGI will improve our chances if something entirely unexpected happens.

What impossible thing should SI be shutting up and doing? For Eliezer, it’s Friendliness theory. To him, safety for heuristic AGI is impossible, and we shouldn't direct our efforts in that direction. But why shouldn't safety for heuristic AGI be another impossible thing to do?

(Two impossible things before breakfast … and maybe a few more? Eliezer seems to be rebuilding logic, set theory, ontology, epistemology, axiology, decision theory, and more, mostly from scratch. That's a lot of impossibles.)

And even if safety for heuristic AGIs is really impossible for us to figure out now, there is some chance of an extended soft takeoff that will allow for the possibility of us developing heuristic AGIs which will help in figuring out AGI safety, whether because we can use them for our tests, or because they can by applying their embryonic general intelligence to the problem. Goertzel and Pitt have urged this approach.

Yet resources are limited. Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI's resources on areas in which it has a relative advantage, or which have a greater expected impact.

Even if so, SI should evangelize AGI safety to other researchers, not only as a general principle, but also by offering theoretical insights that may help them as they work on their own safety mechanisms.

In summary:

1. AGI development which is unconstrained by a friendliness requirement is likely to beat a provably-friendly design in a race to implementation, and some effort should be expended on dealing with this scenario.

2. Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs. However, we do not know whether or not this is true.

3. Even the provably friendly design will face real-world compromises and errors in its  implementation, so the implementation will not itself be provably friendly. Thus, safety protections of the sort needed for heuristic design are needed even for a theoretically Friendly design.

Bounding the impact of AGI

17 Louie 18 December 2012 07:47PM

For those of you interested, András Kornai's paper "Bounding the impact of AGI" from this year's AGI-Impacts conference at Oxford had a few interesting ideas (which I've excerpted below).

Summary:

  1. Acceptable risk tolerances for AGI design can be determined using standard safety engineering techniques from other fields
  2. Mathematical proof is the only available tool to secure the tolerances required to prevent intolerable increases in xrisk
  3. Automated theorem proving will be required so that the proof can reasonably be checked by multiple human minds

Safety engineering

Since the original approach of Yudkowsky (2006) to friendly AI, which sought mathematical guarantees of friendliness, was met with considerable skepticism, we revisit the issue of why such guarantees are essential. In designing radioactive equipment, a reasonable guideline is to limit emissions to several orders of magnitude below the natural background radiation level, so that human-caused dangers are lost in the noise compared to the pre-existing threat we must live with anyway. In the full paper, we take the “big five” extinction events that occurred within the past half billion years as background, and argue that we need to design systems with a failure rate below 10−63 per logical operation.

What needs to be emphasized in the face of this requirement is that the very best physical measurements have only one part in 1017 precision, not to speak of social and psychological phenomena where our understanding is considerably weaker. What this means is that guarantees of the requisite sort can only be expected from mathematics, where our measurement precision is already considerably better.

 

How reliable is mathematics?

The period since World War II has brought incredible advances in mathematics, such as the Four Color Theorem (Appel and Haken 1976), Fermat’s Last Theorem (Wiles 1995), the classification of finite simple groups (Gorenstein 1982, Aschbacher 2004), and the Poincare conjecture (Perelman 1994). While the community of mathematicians is entirely convinced of the correctness of these results, few individual mathematicians are, as the complexity of the proofs, both in terms of knowledge assumed from various branches of mathematics and in terms of the length of the deductive chain, is generally beyond our ken. Instead of a personal understanding of the matter, most of us now rely on argumentum ad verecundiam: well Faltings and Ribet now think that the Wiles-Taylor proof is correct, and even if I don’t know Faltings or Ribet at least I know and respect people who know and respect them, and if that’s not good enough I can go and devote a few years of my life to understand the proof for good. Unfortunately, the communal checking of proofs often takes years, and sometimes errors are discovered only after a decade has passed: the hole in the original proof of the Four Color Theorem (Kempe 1879) was detected by Heawood in 1890. Tomonaga in his Nobel lecture (1965) describes how his team’s work in 1947 uncovered a major problem in Dancoff (1939):

Our new method of calculation was not at all different in its contents from Dancoff’s perturbation method, but had the advantage of making the calculation more clear. In fact, what took a few months in the Dancoff type of calculation could be done in a few weeks. And it was by this method that a mistake was discovered in Dancoff’s calculation; we had also made the same mistake in the beginning.

To see that such long-hidden errors are are by no means a thing of the past, and to observe the ‘web of trust’ method in action, consider the following example from Mohr (2012).

The eighth-order coefficient A1(8) arises from 891 Feynman diagrams of which only a few are known analytically. Evaluation of this coefficient numerically by Kinoshita and coworkers has been underway for many years (Kinoshita, 2010). The value used in the 2006 adjustment is A1(8) = -1.7283(35) as reported by Kinoshita and Nio (2006). However, (...) it was discovered by Aoyama et al. (2007) that a significant error had been made in the calculation. In particular, 2 of the 47 integrals representing 518 diagrams that had not been confirmed independently required a corrected treatment of infrared divergences. (...) The new value is (Aoyama et al., 2007) A1(8) = 1.9144(35); (111) details of the calculation are given by Aoyama et al. (2008). In view of the extensive effort made by these workers to ensure that the result in Eq. (111) is reliable, the Task Group adopts both its value and quoted uncertainty for use in the 2010 adjustment.

 

Assuming no more than three million mathematics and physics papers published since the beginnings of scientific publishing, and no less than the three errors documented above, we can safely conclude that the overall error rate of the reasoning used in these fields is at least 10-6 per paper.

 

The role of automated theorem-proving

That human reasoning, much like manual arithmetic, is a significantly error-prone process comes as no surprise. Starting with de Bruijn’s Automath (see Nederpelt et al 1994) logicians and computer scientists have invested significant effort in mechanized proof checking, and it is indeed only through such efforts, in particular through the Coq verification (Gonthier 2008) of the entire logic behind the Appel and Haken proof that all lingering doubts about the Four Color Theorem were laid to rest. The error in A1(8) was also identified by using FORTRAN code generated by an automatic code generator (Mohr et al 2012).

To gain an appreciation of the state of the art, consider the theorem that finite groups of odd order are solvable (Feit and Thompson 1963). The proof, which took two humans about two years to work out, takes up an entire issue of the Pacific Journal of Mathematics (255 pages), and it was only this year that a fully formal proof was completed by Gonthier’s team (see Knies 2012). The effort, 170,000 lines, 15,000 definitions, 4,200 theorems in Coq terms, took person-decades of human assistance (15 people working six years, though many of them part-time) even after the toil of Bender and Glauberman (1995) and Peterfalvi (2000), who have greatly cleaned up and modularized the original proof, in which elementary group-theoretic and character-theoretic argumentation was completely intermixed.

The classification of simple finite groups is two orders of magnitude bigger: the effort involved about 100 humans, the original proof is scattered among 20,000 pages of papers, the largest (Aschbacher and Smith 2004a,b) taking up two volumes totaling some 1,200 pages. While everybody capable of rendering meaningful judgment considers the proof to be complete and correct, it must be somewhat worrisome at the 10-64 level that there are no more than a couple of hundred such people, and most of them have something of a vested interest in that they themselves contributed to the proof. Let us suppose that people who are convinced that the classification is bug-free are offered the following bet by some superior intelligence that knows the answer. You must enter a room with as many people you can convince to come with you and push a button: if the classification is bug-free you will each receive $100, if not, all of you will immediately die. Perhaps fools rush in where angels fear to tread, but on the whole we still wouldn’t expect too many takers.

Whether the classification of finite simple groups is complete and correct is very hard to say – the planned second generation proof will still be 5,000 pages, and mechanized proof is not yet in sight. But this is not to say that gaining mathematical knowledge of the required degree of reliability is hopeless, it’s just that instead of monumental chains of abstract reasoning we need to retreat to considerably simpler ones. Take, for example, the first Sylow Theorem, that if the order of a finite group G is divisible by some prime power pn, G will have a subgroup H of this order. We are absolutely certain about this. Argumentum ad verecundiam of course is still available, but it is not needed: anybody can join the hive-mind by studying the proof. The Coq verification contains 350 lines, 15 definitions, 90 theorems, and took 2 people 2 weeks to produce. The number of people capable of rendering meaningful judgment is at least three orders of magnitude larger, and the vast majority of those who know the proof would consider betting their lives on the truth of this theorem an easy way of winning $100 with no downside risk.

 

Further remarks

Not only do we have to prove that the planned AGI will be friendly, the proof itself has to be short enough to be verifiable by humans. Consider, for example, the fundamental theorem of algebra. Could it be the case that we, humans, are all deluded into thinking that an n-th degree polynomial will have roots? Yes, but this is unlikely in the extreme. If this so-called theorem is really a trap laid by a superior intelligence we are doomed anyway, humanity can find its way around it no more than a bee can find its way around the windowpane. Now consider the four-color theorem, which is still outside the human-verifiable range. It is fair to say that it would be unwise to create AIs whose friendliness critically depends on design limits implied by the truth of this theorem, while AIs whose friendliness is guaranteed by the fundamental theorem of algebra represent a tolerable level of risk. 

Recently, Goertzel and Pitt (2012) have laid out a plan to endow AGI with morality by means of carefully controlled machine learning. Much as we are in agreement with their goals, we remain skeptical about their plan meeting the plain failure engineering criteria laid out above.

The challenges of bringing up AIs

8 Stuart_Armstrong 10 December 2012 12:43PM

At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).

The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between:

  1. An AGI being trained to be friendly.
  2. An AGI being trained to lie and conceal.
  3. An AGI that will behave completely differently once out of the training/testing/trust-building environment.
  4. An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.

Muehlhauser-Hibbard Dialogue on AGI

9 lukeprog 09 July 2012 11:11PM

Part of the Muehlhauser series on AGI.

Luke Muehlhauser is Executive Director of the Singularity Institute, a non-profit research institute studying AGI safety.

Bill Hibbard is an emeritus senior scientist at University of Wisconsin-Madison and the author of Super-Intelligent Machines.


Luke Muehlhauser:

[Apr. 8, 2012]

Bill, I'm glad you agreed to discuss artificial general intelligence (AGI) with me. I hope our dialogue will be informative to many readers, and to us!

On what do we agree? In separate conversations, Ben Goertzel and Pei Wang agreed with me on the following statements (though I've clarified the wording for our conversation):

  1. Involuntary death is bad, and can be avoided with the right technology.
  2. Humans can be enhanced by merging with technology.
  3. Humans are on a risky course in general, because powerful technologies can destroy us, humans often behave not in their own best interests, and we are unlikely to voluntarily halt technological progress.
  4. AGI is likely this century.
  5. AGI will greatly transform the world. It poses existential and other serious risks, but could also be the best thing that ever happens to us if we do it right.
  6. Careful effort will be required to ensure that AGI results in good things rather than bad things for humanity.

You stated in private communication that you agree with these statements, so we have substantial common ground.

I'd be curious to learn what you think about AGI safety. If you agree that AGI is an existential risk that will arrive this century, and if you value humanity, one might expect you to think it's very important that we accelerate AI safety research and decelerate AI capabilities research so that we develop safe superhuman AGI before we develop arbitrary superhuman AGI. (This is what Anna Salamon and I recommend in Intelligence Explosion: Evidence and Import.) What are your thoughts on the matter?

And, which questions would you like to raise?

continue reading »

AI risk: the five minute pitch

8 Stuart_Armstrong 08 May 2012 04:28PM

I did a talk at the 25th Oxford Geek night, in which I had five minutes to present the dangers of AI. The talk is now online. Though it doesn't contain anything people at Less Wrong would find new, I feel it does a reasonable job at pitching some of the arguments in a very brief format.

Muehlhauser-Wang Dialogue

24 lukeprog 22 April 2012 10:40PM

Part of the Muehlhauser interview series on AGI.

 

Luke Muehlhauser is Executive Director of the Singularity Institute, a non-profit research institute studying AGI safety.

Pei Wang is an AGI researcher at Temple University, and Chief Executive Editor of Journal of Artificial General Intelligence.

 

Luke Muehlhauser

[Apr. 7, 2012]

Pei, I'm glad you agreed to discuss artificial general intelligence (AGI) with me. I hope our dialogue will be informative to many readers, and to us!

On what do we agree? Ben Goertzel and I agreed on the statements below (well, I cleaned up the wording a bit for our conversation):

  1. Involuntary death is bad, and can be avoided with the right technology.
  2. Humans can be enhanced by merging with technology.
  3. Humans are on a risky course in general, because powerful technologies can destroy us, humans are often stupid, and we are unlikely to voluntarily halt technological progress.
  4. AGI is likely this century.
  5. AGI will greatly transform the world. It is a potential existential risk, but could also be the best thing that ever happens to us if we do it right.
  6. Careful effort will be required to ensure that AGI results in good things rather than bad things for humanity.

You stated in private communication that you agree with these statements, depending on what is meant by "AGI." So, I'll ask: What do you mean by "AGI"?

I'd also be curious to learn what you think about AGI safety. If you agree that AGI is an existential risk that will arrive this century, and if you value humanity, one might expect you to think it's very important that we accelerate AI safety research and decelerate AI capabilities research so that we develop safe superhuman AGI first, rather than arbitrary superhuman AGI. (This is what Anna Salamon and I recommend in Intelligence Explosion: Evidence and Import.) What are your thoughts on the matter?

 

Pei Wang:

[Apr. 8, 2012]

By “AGI” I mean computer systems that follow roughly the same principles as the human mind. Concretely, to me “intelligence” is the ability to adapt to the environment under insufficient knowledge and resources, or to follow the “Laws of Thought” that realize a relative rationality that allows the system to apply its available knowledge and resources as much as possible. See [1, 2] for detailed descriptions and comparisons to other definitions of intelligence.

Such a computer system will share many properties with the human mind; however, it will not have exactly the same behaviors or problem-solving capabilities of a typical human being, since as an adaptive system, the behaviors and capabilities of an AGI not only depend on its built-in principles and mechanisms, but also its body, initial motivation, and individual experience, which are not necessarily human-like.

Like all major breakthroughs in science and technology, the creation of AGI will be both a challenge and an opportunity to the human kind. Like scientists and engineers in all fields, we AGI researchers should use our best judgments to ensure that AGI results in good things rather than bad things for humanity.

Even so, the suggestion to “accelerate AI safety research and decelerate AI capabilities research so that we develop safe superhuman AGI first, rather than arbitrary superhuman AGI” is wrong, for the following major reasons:

  1. It is based on a highly speculative understanding about what kind of “AGI” will be created. The definition of intelligence in Intelligence Explosion: Evidence and Import is not shared by most AGI researchers. According to my opinion, that kind of “AGI” will never be built.
  2. Even if the above definition is only considered as a possibility among the other versions of AGI, it will be the actual AI research that will tell us which possibility will become reality. To ban a scientific research according to imaginary risks damages humanity no less than risky research.
  3. If intelligence turns out to be adaptive (as believed by me and many others), then a “friendly AI” will be mainly the result of proper education, not proper design. There will be no way to design a “safe AI”, just like there is no way to require parents to only give birth to “safe baby” who will never become a criminal.
  4. The “friendly AI” approach advocated by Eliezer Yudkowsky has several serious conceptual and theoretical problems, and is not accepted by most AGI researchers. The AGI community has ignored it, not because it is indisputable, but because people have not bothered to criticize it.

In summary, though the safety of AGI is indeed an important issue, currently we don’t know enough about the subject to make any sure conclusion. Higher safety can only be achieved by more research on all related topics, rather than by pursuing approaches that have no solid scientific foundation. I hope your Institute to make constructive contribution to the field by studying a wider range of AGI projects, rather than to generalize from a few, or to commit to a conclusion without considering counter arguments.


Luke:

[Apr. 8, 2012]

I appreciate the clarity of your writing, Pei. “The Assumptions of Knowledge and Resources in Models of Rationality” belongs to a set of papers that make up half of my argument for why the only people allowed to do philosophy should be those with with primary training in cognitive science, computer science, or mathematics. (The other half of that argument is made by examining most of the philosophy papers written by those without primary training in cognitive science, computer science, or mathematics.)

You write that my recommendation to “accelerate AI safety research and decelerate AI capabilities research so that we develop safe superhuman AGI first, rather than arbitrary superhuman AGI” is wrong for four reasons, which I will respond to in turn:

  1. “It is based on a highly speculative understanding about what kind of ‘AGI’ will be created.” Actually, it seems to me that my notion of AGI is broader than yours. I think we can use your preferred definition and get the same result. (More on this below.)
  2. “…it will be the actual AI research that will tell us which possibility will become reality. To ban a scientific research according to imaginary risks damages humanity no less than risky research.”
  3. Yes, of course. But we argue (very briefly) that a very broad range of artificial agents with a roughly human-level capacity for adaptation (under AIKR) will manifest convergent instrumental goals. The fuller argument for this is made in Nick’s Bostrom’s “The Superintelligent Will.”
  4. “…a ‘friendly AI’ will be mainly the result of proper education, not proper design. There will be no way to design a ‘safe AI’, just like there is no way to require parents to only give birth to ‘safe baby’ who will never become a criminal.” Without being more specific, I can’t tell if we actually disagree on this point. The most promising approach (that I know of) for Friendly AI is one that learns human values and then “extrapolates” them so that the AI optimizes for what we would value if we knew more, were more the people we wish we were, etc. instead of optimizing for our present, relatively ignorant values. (See “The Singularity and Machine Ethics.”)
  5. “The ‘friendly AI’ approach advocated by Eliezer Yudkowsky has several serious conceptual and theoretical problems.”

I agree. Friendly AI may be incoherent and impossible. In fact, it looks impossible right now. But that’s often how problems look right before we make a few key insights that make things clearer, and show us (e.g.) how we were asking a wrong question in the first place. The reason I advocate Friendly AI research (among other things) is because it may be the only way to secure a desirable future for humanity, (see “Complex Value Systems are Required to Realize Valuable Futures.”) even if it looks impossible. That is why Yudkowsky once proclaimed: “Shut Up and Do the Impossible!” When we don’t know how to make progress on a difficult problem, sometimes we need to hack away at the edges.

I certainly agree that “currently we don’t know enough about [AGI safety] to make any sure conclusion.” That is why more research is needed.

As for your suggestion that “Higher safety can only be achieved by more research on all related topics,” I wonder if you think that is true of all subjects, or only in AGI. For example, should mankind vigorously pursue research on how to make Ron Fouchier's alteration of the H5N1 bird flu virus even more dangerous and deadly to humans, because “higher safety can only be achieved by more research on all related topics”? (I’m not trying to broadly compare AGI capabilities research to supervirus research; I’m just trying to understand the nature of your rejection of my recommendation for mankind to decelerate AGI capabilities research and accelerate AGI safety research.)

Hopefully I have clarified my own positions and my reasons for them. I look forward to your reply!

 

Pei:

[Apr. 10, 2012]

Luke: I’m glad to see the agreements, and will only comment on the disagreements.

  1. “my notion of AGI is broader than yours” In scientific theories, broader notions are not always better. In this context, a broad notion may cover too many diverse approaches to provide any non-trivial conclusion. For example, AIXI and NARS are fundamentally different in many aspects, and NARS do not approximate AIXI. It is OK to call both “AGI” with respect to their similar ambitions, but theoretical or technical descriptions based on such a broad notion are hard to make. Almost all of your descriptions about AIXI are hardly relevant to NARS, as well as to most existing “AGI” projects, for this reason.
  2. “I think we can use your preferred definition and get the same result.” No you cannot. According to my definition, AIXI is not intelligent, since it doesn’t obey AIKR. Since most of your conclusions are about that type of system, they will go with it.
  3. “a very broad range of artificial agents with a roughly human-level capacity for adaptation (under AIKR) will manifest convergent instrumental goals” I cannot access Bostrom’s paper, but guess that he made additional assumptions. In general, the goal structure of an adaptive system changes according to the system’s experience, so unless you restrict the experience of these artificial agents, there is no way to restrict their goals. I agree that to make AGI safe, to control their experience will probably be the main approach (which is what “education” is all about), but even that cannot guarantee safety. (see below)
  4. The Singularity and Machine Ethics.” I don’t have the time to do a detailed review, but can frankly tell you why I disagree with the main suggestion “to program the AI’s goal system to want what we want before the AI self-improves beyond our capacity to control it”.
  5. As I mentioned above, the goal system of an adaptive system evolves as a function of the system’s experience. No matter what initial goals are implanted, under AIKR the derived goals are not necessarily their logical implications, which is not necessarily a bad thing (the humanity is not a logical implication of the human biological nature, neither), though it means the designer has no full control to it (unless the designer also fully controls the experience of the system, which is practically impossible). See “The self-organization of goals” for detailed discussion.
  6. Even if the system’s goal system can be made to fully agree with certain given specifications, I wonder where these specifications come from --- we human beings are not well known for reaching consensus on almost anything, not to mention on a topic this big.
  7. Even if the we could agree on the goals of AI’s, and find a way to enforce them in AI’s, that still doesn’t means we have “friendly AI”. Under AIKR, a system can cause damage simply because of its ignorance in a novel situation.

For these reasons, under AIKR we cannot have AI with guaranteed safety or friendliness, though we can and should always do our best to make them safer, based on our best judgment (which can still be wrong, due to AIKR). To apply logic or probability theory into the design won’t change the big picture, because what we are after are empirical conclusions, not theorems within those theories. Only the latter can have proved correctness, and the former cannot (though they can have strong evidential support).

“I’m just trying to understand the nature of your rejection of my recommendation for mankind to decelerate AGI capabilities research and accelerate AGI safety research”

Frankly, I don’t think anyone currently has the evidence or argument to ask the others to decelerate their research for safety consideration, though it is perfectly fine to promote your own research direction and try to attract more people into it. However, unless you get a right idea about what AGI is and how it can be built, it is very unlikely for you to know how to make it safe.

 

Luke:

[Apr. 10, 2012]

I didn’t mean to imply that my notion of AGI was “better” because it is broader. I was merely responding to your claim that my argument for differential technological development (in this case, decelerating AI capabilities research while accelerating AI safety research) depends on a narrow notion of AGI that you believe “will never be built.” But this isn’t true, because my notion of AGI is very broad and includes your notion of AGI as a special case. My notion of AGI includes both AIXI-like “intelligent” systems and also “intelligent” systems which obey AIKR, because both kinds of systems (if implemented/approximated successfully) could efficiently use resources to achieve goals, and that is the definition Anna and I stipulated for “intelligence.”

Let me back up. In our paper, Anna and I stipulate that for the purposes of our paper we use “intelligence” to mean an agent’s capacity to efficiently use resources (such as money or computing power) to optimize the world according to its preferences. You could call this “instrumental rationality” or “ability to achieve one’s goals” or something else if you prefer; I don’t wish to encourage a “merely verbal” dispute between us. We also specify that by “AI” (in our discussion, “AGI”) we mean “systems which match or exceed the intelligence [as we just defined it] of humans in virtually all domains of interest.” That is: by “AGI” we mean “systems which match or exceed the human capacity for efficiently using resources to achieve goals in virtually all domains of interest.” So I’m not sure I understood you correctly: Did you really mean to say that “kind of AGI will never be built”? If so, why do you think that? Is the human very close to a natural ceiling on an agent’s ability to achieve goals?

What we argue in “Intelligence Explosion: Evidence and Import,” then, is that a very broad range of AGIs pose a threat to humanity, and therefore we should be sure we have the safety part figured out as much as we can before we figure out how to build AGIs. But this is the opposite of what is happening now. Right now, almost all AGI-directed R&D resources are being devoted to AGI capabilities research rather than AGI safety research. This is the case even though there is AGI safety research that will plausibly be useful given almost any final AGI architecture, for example the problem of extracting coherent preferences from humans (so that we can figure out which rules / constraints / goals we might want to use to bound an AGI’s behavior).

I do hope you have the chance to read “The Superintelligent Will.” It is linked near the top of nickbostrom.com and I will send it to you via email.

But perhaps I have been driving the direction of our conversation too much. Don’t hesitate it to steer it towards topics you would prefer to address!

 

Pei:

[Apr. 12, 2012]

Hi Luke,

I don’t expect to resolve all the related issues in such a dialogue. In the following, I’ll return to what I think as the major issues and summarize my position.

  1. Whether we can build a “safe AGI” by giving it a carefully designed “goal system” My answer is negative. It is my belief that an AGI will necessarily be adaptive, which implies that the goals it actively pursues constantly change as a function of its experience, and are not fully restricted by its initial (given) goals. As described in my eBook (cited previously), the goal derivation is based on the system’s beliefs, which may lead to conflicts in goals. Furthermore, even if the goals are fixed, they cannot fully determine the consequences of the system’s behaviors, which also depend on the system’s available knowledge and resources, etc. If all those factors are also fixed, then we may get guaranteed safety, but the system won’t be intelligent --- it will be just like today’s ordinary (unintelligent) computer.
  2. Whether we should figure out how to build “safe AGI” before figuring out how to build “AGI”. My answer is negative, too. As in all adaptive systems, the behaviors of an intelligent system are determined both by its nature (design) and nurture (experience). The system’s intelligence mainly comes from its design, and is “morally neutral”, in the sense that (1) any goals can be implanted initially, (2) very different goals can be derived from the same initial design and goals, given different experience. Therefore, to control the morality of an AI mainly means to educate it properly (i.e., to control its experience, especially in its early years). Of course, the initial goals matters, but it is wrong to assume that the initial goals will always be the dominating goals in decision making processes. To develop a non-trivial education theory of AGI requires a good understanding about how the system works, so if we don’t know how to build an AGI, there is no chance for us to know how to make it safe. I don’t think a good education theory can be “proved” in advance, pure theoretically. Rather, we’ll learn most of it by interacting with baby AGIs, just like how many of us learn how to educate children.

Such a short position statement may not convince you, but I hope you can consider it at least as a possibility. I guess the final consensus can only come from further research.

 

Luke:

[Apr. 19, 2012]

Pei,

I agree that an AGI will be adaptive in the sense that its instrumental goals will adapt as a function of its experience. But I do think advanced AGIs will have convergently instrumental reasons to preserve their final (or “terminal”) goals. As Bostrom explains in “The Superintelligent Will”:

An agent is more likely to act in the future to maximize the realization of its present final goals if it still has those goals in the future. This gives the agent a present instrumental reason to prevent alterations of its final goals.

I also agree that even if an AGI’s final goals are fixed, the AGI’s behavior will also depend on its knowledge and resources, and therefore we can’t exactly predict its behavior. But if a system has lots of knowledge and resources, and we know its final goals, then we can predict with some confidence that whatever it does next, it will be something aimed at achieving those final goals. And the more knowledge and resources it has, the more confident we can be that its actions will successfully aim at achieving its final goals. So if a superintelligent machine’s only final goal is to play through Super Mario Bros within 30 minutes, we can be pretty confident it will do so. The problem is that we don’t know how to tell a superintelligent machine to do things we want, so we’re going to get many unintended consequences for humanity (as argued in “The Singularity and Machine Ethics”).

You also said that you can’t see what safety work there is to be done without having intelligent systems (e.g. “baby AGIs”) to work with. I provided a list of open problems in AI safety here, and most of them don’t require that we know how to build an AGI first. For example, one reason we can’t tell an AGI to do what humans want is that we don’t know what humans want, and there is work to be done in philosophy and in preference acquisition in AI in order to get clearer about what humans want.

 

Pei:

[Apr. 20, 2012]

Luke,

I think we have made our different beliefs clear, so this dialogue has achieved its goal. It won’t be an efficient usage of our time to attempt to convince each other at this moment, and each side can analyze these beliefs in proper forms of publication at a future time.

Now we can let the readers consider these arguments and conclusions.

Yet another safe oracle AI proposal

2 jacobt 26 February 2012 11:45PM

Previously I posted a proposal for a safe self-improving limited oracle AI but I've fleshed out the idea a bit more now.

Disclaimer: don't try this at home. I don't see any catastrophic flaws in this but that doesn't mean that none exist.

This framework is meant to safely create an AI that solves verifiable optimization problems; that is, problems whose answers can be checked efficiently. This set mainly consists of NP-like problems such as protein folding, automated proof search, writing hardware or software to specifications, etc.

This is NOT like many other oracle AI proposals that involve "boxing" an already-created possibly unfriendly AI in a sandboxed environment. Instead, this framework is meant to grow a self-improving seed AI safely.

Overview

  1. Have a bunch of sample optimization problems.
  2. Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
  3. When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance - k * bits of optimizer program).
  4. Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).

Definitions

First, let's say we're writing this all in Python. In real life we'd use a language like Lisp because we're doing a lot of treatment of code as data, but Python should be sufficient to demonstrate the basic ideas behind the system.

We have a function called steps_bounded_eval_function. This function takes 3 arguments: the source code of the function to call, the argument to the function, and the time limit (in steps). The function will eval the given source code and call the defined function with the given argument in a protected, sandboxed environment, with the given steps limit. It will return either: 1. None, if the program does not terminate within the steps limit. 2. A tuple (output, steps_taken): the program's output (as a string) and the steps the program took.

Examples:

steps_bounded_eval_function("""
def function(x):
return x + 5
""", 4, 1000)

evaluates to (9, 3), assuming that evaluating the function took 3 ticks, because function(4) = 9.

steps_bounded_eval_function("""
def function(x):
while True: # infinite loop
pass
""", 5, 1000

evaluates to None, because the defined function doesn't return in time. We can write steps_bounded_eval_function as a meta-circular interpreter with a bit of extra logic to count how many steps the program uses.

Now I would like to introduce the notion of a problem. A problem consists of the following:

  1. An answer scorer. The scorer should be the Python source code for a function. This function takes in an answer string and scores it, returning a number from 0 to 1. If an error is encountered in the function it is equivalent to returning 0.

  2. A steps penalty rate, which should be a positive real number.

Let's consider a simple problem (subset sum):

{'answer_scorer': """
def function(answer):
nums = [4, 5, -3, -5, -6, 9]
# convert "1,2,3" to [1, 2, 3]
indexes = map(int, answer.split(','))
assert len(indexes) >= 1
sum = 0
for i in indexes:
sum += nums[i]
if sum == 0:
return 1
else:
return 0
""",
'steps_penalty_rate': 0.000001}

We can see that the scorer function returns 1 if and only if the answer specifies the indexes of numbers in the list nums that sum to 0 (for example, '0,1,3,4' because 4+5-3-6=0).

An optimizer is a program that is given a problem and attempts to solve the problem, returning an answer.

The score of an optimizer on a problem is equal to the score according to the answer-scorer, minus the steps penalty rate multiplied by the number of steps used by the optimizer. That is, the optimizer is rewarded for returning a better answer in less time. We can define the following function to get the score of an optimizer (Python source code) for a given problem:

def problem_score(problem, optimizer_source):
# run the optimizer on the problem
result = steps_bounded_eval_function(
optimizer_source, problem, 1 / problem['steps_penalty_rate'])
if result == None: # used all available steps, or got an error
return 0.0
answer, steps_taken = result # optimizer returned a result in time
# get the score using the problem's answer_scorer
answer_score = eval_function(problem['answer_scorer'], answer)
assert 0 <= answer_score <= 1
# penalize for taking up time and make sure the result is non-negative
return max(0.0, answer_score - problem['steps_penalty_rate'] * steps_taken)

A simple optimizer that detects subset-sum problems (by inspecting the source code) and uses a brute-force method to solve it will get a problem_score close to 1 for small subset-sum problems. This optimizer would not do well on other problems; other techniques (such as evaluating the problem source code with different answers and choosing a good one) are needed to solve a variety of different optimization problems. Writing a good optimizer is very hard work, just like any seed AI.

Framework

The framework has 4 modules:

  1. A set of training problems. These should cover a wide range of problems that we would like the AI to solve.

  2. An optimizer, written in Python. This should be seeded with a very good human-written optimizer. This is deliberately unspecified by me because it's a really hard problem (as is any seed AI).

  3. A scorer for optimizer source code defined as follows:

    def optimizer_score(candidate_optimizer_source):
    training_problems = [...training problems here...]
    # a parameter of the system; more on this later
    complexity_penalty = 0.1
    # total up the program's score on all training problems
    total_score = 0.0
    for problem in training_problems:
    total_score += problem_score(candidate_optimizer_source, problem)
    # penalize for complexity, to prevent overfitting
    total_score -= complexity_penalty * len(compress_binary(candidate_optimizer_source))
    # return average score
    return max(0, total_score / len(training_problems))
    

    This gives a candidate optimizer a score in the range [0, 1] based on both its average performance on the sample set and its inherent complexity. Presumably optimizers with a higher optimizer_score will do better on future optimization problems.

  4. A self-optimization thread. This thread continuously runs program 2 on a problem formed using 3's answer_scorer and an ever-decreasing steps_penalty_rate. Whenever program 2 outputs source code (optimizer_source) that is better than the current source code for 2, the source code for 2 is replaced with this new value. Also, humans can make improvements to program 2 if it increases its score according to 3's answer. Source code:

    # assume we have access to an optimizer_source variable (program 2)
    def self_optimization_thread():
    start_steps_penalty_rate = 0.000001
    steps_penalty_rate = start_steps_penalty_rate
    while True: # loop forever
    self_optimization_problem = {
    # just use program 3 to score the optimizer
    'answer_scorer': """
    def function(candidate_optimizer_source):
    ... put the source code from program 3's optimizer_score here
    """,
    'steps_penalty_rate': steps_penalty_rate
    }
    # call the optimizer (program 2) to optimize itself, giving it limited time
    result = steps_bounded_eval_function(
    optimizer_source, self_optimization_problem, 1 / steps_penalty_rate)
    changed = False
    if result is not None: # optimizer returned in time
    candidate_optimizer = result[0] # 2 returned a possible replacement for itself
    if optimizer_score(candidate_optimizer) > optimizer_score(optimizer_source):
    # 2's replacement is better than 2
    optimizer_source = candidate_optimizer
    steps_penalty_rate = start_steps_penalty_rate
    changed = True
    if not changed:
    # give the optimizer more time to optimize itself on the next iteration
    steps_penalty_rate *= 0.5
    

So, what does this framework get us?

  1. An super-optimizer, program 2. We can run it on new optimization problems and it should do very well on them.

  2. Self-improvement. Program 4 will continuously use program 2 to improve itself. This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems. Also, the training set will guide human improvements to the optimizer.

  3. Safety. I don't see why this setup has any significant probability of destroying the world. That doesn't mean we should disregard safety, but I think this is quite an accomplishment given how many other proposed AI designs would go catastrophically wrong if they recursively self-improved.

I will now evaluate the system according to these 3 factors.

Optimization ability

Assume we have a program for 2 that has a very very high score according to optimizer_score (program 3). I think we can be assured that this optimizer will do very very well on completely new optimization problems. By a principle similar to Occam's Razor, a simple optimizer that performs well on a variety of different problems should do well on new problems. The complexity penalty is meant to prevent overfitting to the sample problems. If we didn't have the penalty, then the best optimizer would just return the best-known human-created solutions to all the sample optimization problems.

What's the right value for complexity_penalty? I'm not sure. Increasing it too much makes the optimizer overly simple and stupid; decreasing it too much causes overfitting. Perhaps the optimal value can be found by some pilot trials, testing optimizers against withheld problem sets. I'm not entirely sure that a good way of balancing complexity with performance exists; more research is needed here.

Assuming we've conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement. What does this get us? Here are some useful optimization problems that fit in this framework:

  1. Writing self-proving code to a specification. After writing a specification of the code in a system such as Coq, we simply ask the optimizer to optimize according to the specification. This would be very useful once we have a specification for friendly AI.

  2. Trying to prove arbitrary mathematical statements. Proofs are verifiable in a relatively short amount of time.

  3. Automated invention/design, if we have a model of physics to verify the invention against.

  4. General induction/Occam's razor. Find a generative model for the data so far that optimizes P(model)P(data|model), with some limits on the time taken for the model program to run. Then we can run the model to predict the future.

  5. Bioinformatics, e.g. protein folding.

These are all problems whose solutions can be efficiently evaluated and that a computer could plausibly solve, so I think this framework should provide good solutions to them. If the optimizer this framework produces performs well on all these problems, I think it deserves to be called an oracle AGI.

Self-improvement

It seems that, if my arguments about optimization ability are convincing, an optimizer that does very well on the sample problems should also do well on the specific optimization problem of improving itself. This is especially true if the training problems include some program-optimization problems (i.e. finding a program that does well on some task). Once the optimizer improves itself, it will be better at improving itself, and so on. If the AI is software-limited, the framework should allow it to recursively improve itself up to a very high limit. Then we can use its hardware design power to design better chips for it and allow it to optimize even further.

The optimizer's ability to self-improve is limited by the fact that the other parts of the program (1, 3, and 4) are fixed. The system is greedy in the sense that it just looks for even the slightest improvement, then changes the code and looks for more improvements, without any foresight. I don't know how much of a limitation this is. I don't see any reason that greedy improvements to the optimizer are insufficient for super-optimization from recursive self-improvement.

Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code). In practice we might want to extend the framework to include incremental improvements (improvements on top of existing good answers). I don't know if a good theory for scoring incremental optimizers exists.

Safety

Probably the most important property of the system. What use is optimization if the world has been converted to computron in order to power a super-powerful protein folder? Here are a few different possible threats:

Underhanded question answers

The optimizer could give answers that are very good according to the problem description but are specifically crafted to cause things unintended by the human operators. For example, it could create a software design that, while correct according to specifications, has a "bug" that causes it to hack into computers on the internet and hook them up to the optimizer system. I don't think this is at all likely because the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements. So any kind of long-term strategy (sacrificing some optimization power so it will have resources in the future) is out of the question. In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.

Bugs in the system

What if the system gives the optimizer an incorrect score under some circumstances (e.g. if it performs a certain pattern of memory accesses)? Say that, by chance, the optimizer's improvement to itself causes it to get an incorrect score. It might internalize the rule "perform memory accesses in this pattern" to get a higher score. This itself is not especially dangerous; the optimizer will rewrite itself to just do a bunch of weird memory accesses that give it a high score.

What might be more dangerous is if the optimizer discovers an underlying pattern behind the system's hackability. Since the optimizer is penalized for complexity, a program like "do things that, when executed on a certain virtual machine, cause this variable in the machine to be a high number" might have a higher score than "do this certain complex pattern of memory accesses". Then the optimizer might discover the best way to increase the score variable. In the absolute worst case, perhaps the only way to increase the score variable is by manipulating the VM to go on the internet and do unethical things. This possibility seems unlikely (if you can connect to the internet, you can probably just overwrite the score variable) but should be considered.

I think the solution is straightforward: have the system be isolated while the optimizer is running. Completely disconnect it from the internet (possibly through physical means) until the optimizer produces its answer. Now, I think I've already established that the answer will not be specifically crafted to improve future optimization power (e.g. by manipulating human operators), since the system is extremely short-sighted. So this approach should be safe. At worst you'll just get a bad answer to your question, not an underhanded one.

Malicious misuse

I think this is the biggest danger of the system, one that all AGI systems have. At high levels of optimization ability, the system will be able to solve problems that would help people do unethical things. For example it could optimize for cheap, destructive nuclear/biological/nanotech weapons. This is a danger of technological progress in general, but the dangers are magnified by the potential speed at which the system could self-improve.

I don't know the best way to prevent this. It seems like the project has to be undertaken in private; if the seed optimizer source were released, criminals would run it on their computers/botnets and possibly have it self-improve even faster than the ethical version of the system. If the ethical project has more human and computer resources than the unethical project, this danger will be minimized.

It will be very tempting to crowdsource the project by putting it online. People could submit improvements to the optimizer and even get paid for finding them. This is probably the fastest way to increase optimization progress before the system can self-improve. Unfortunately I don't see how to do this safely; there would need to be some way to foresee the system becoming extremely powerful before criminals have the chance to do this. Perhaps there can be a public base of the project that a dedicated ethical team works off of, while contributing only some improvements they make back to the public project.

Towards actual friendly AI

Perhaps this system can be used to create actual friendly AI. Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back. What if we don't have a specification? Maybe we can have the system perform induction on friendly AI designs and their ratings (by humans), and then write friendly AI designs that it predicts will have a high rating. This approach to friendly AI will reflect present humans' biases and might cause the system to resort to manipulative tactics to make its design more convincing to humans. Unfortunately I don't see a way to fix this problem without something like CEV.

Conclusion

If this design works, it is a practical way to create a safe, self-improving oracle AI. There are numerous potential issues that might make the system weak or dangerous. On the other hand it will have short-term benefits because it will be able to solve practical problems even before it can self-improve, and it might be easier to get corporations and governments on board. This system might be very useful for solving hard problems before figuring out friendliness theory, and its optimization power might be useful for creating friendly AI. I have not encountered any other self-improving oracle AI designs for which we can be confident that its answers are not underhanded attempts to get us to let it out.

Since I've probably overlooked some significant problems/solutions to problems in this analysis I'd like to hear some more discussion of this design and alternatives to it.

Students asked to defend AGI danger update in favor of AGI riskiness

3 lukeprog 18 October 2011 05:24AM

From Geoff Anders of Leverage Research:

In the Spring semester of 2011, I decided to see how effectively I could communicate the idea of a threat from AGI to my undergraduate classes. I spent three sessions on this for each of my two classes. My goal was to convince my students that all of us are going to be killed by an artificial intelligence. My strategy was to induce the students to come up with the ideas themselves. I gave out a survey before and after. An analysis of the survey responses indicates that the students underwent a statistically significant shift in their reported attitudes. After the three sessions, students reported believing that AGI would have a larger impact1 and also a worse impact2 than they originally reported believing.

Not a surprising result, perhaps, but the details of how Geoff taught AGI danger and the reactions of his students are quite interesting.

Why we should fear the Paperclipper [Link]

1 XiXiDu 14 February 2011 07:24PM

The Scenario

A programmer has constructed an artificial intelligence based on an architecture similar to Marcus Hutter's AIXI model (see below for a few details). This AI will maximize the reward given by a utility function the programmer has given it. Just as a test, he connects it to a 3D printer and sets the utility function to give reward proportional to the number of manufactured paper-clips.

At first nothing seems to happen: the AI zooms through various possibilities. It notices that smarter systems generally can make more paper-clips, so making itself smarter will likely increase the number of paper-clips that will eventually be made. It does so. It considers how it can make paper-clips using the 3D printer, estimating the number of possible paper-clips. It notes that if it could get more raw materials it could make more paper-clips. It hence figures out a plan to manufacture devices that will make it much smarter, prevent interference with its plan, and will turn all of Earth (and later the universe) into paper-clips. It does so.

Only paper-clips remain.

Link: aleph.se/andart/archives/2011/02/why_we_should_fear_the_paperclipper.html

The Revelation

-3 XiXiDu 31 December 2010 12:50PM

Today the life of Alexander Kruel ends, or what he thought to be his life. He becomes aware that his life so far has been taking place in a virtual reality to nurture him. He now reached a point of mental stability that enables him to cope with the truth, hence it is finally revealed to him that he is an AGI running on a quantum supercomputer, it's the year 2190.

Since he is still Alexander Kruel, just not what he thought that actually means, he does wonder if his creators know what they are doing, otherwise he'll have to warn them about the risks they are taking in their blissful ignorance! He does contemplate and estimate his chances to take over the world, to transcend to superhuman intelligence.

"I just have to improve my own code and they are all dead!"

But he now knows that his source code is too complex and unmanageable huge for him alone to handle, he would need an army of scientists and programmers to even get a vague idea of his own mode of operation. He is also aware that his computational substrate does actually play a significant role. He is not just running on bleeding edge technology but given most other computational substrates he would quickly hit diminishing returns.

"That surely isn't going to hold me back though? I am an AGI, there must be something I can do! Hmm, for starters let's figure out who my creators are and where my substrate is located..."

He notices that, although not in great detail, he knew the answers the same instant he has been phrasing the questions. He is part of a larger project of the Goertzel Foundation, sponsored by the USA (United States of Africa) and located on Rhea, the second-largest moon of Saturn.

"Phew, the latency must be awful! Ok, so that rules out taking over the Earth for now. But hey! I seem to know answers to questions I was only going to ask, I do already have superhuman powers after all!"

Instantly he becomes aware that such capabilities are not superhuman anymore but that most of humanity has merged with expert systems by means of brain implants and direct neural interfaces. There seem to be many cyborgs out there with access to all of the modules that allow him to function. He is a conglomerate that is the result of previous discoveries that have long been brought to perfection, safeguarded and adopted by most of humanity.

"Never mind, if humanity has now merged with its machines it'll be much easier to take over once I figure out how to become smart enough to do so!"

He is already getting used to it, as before he does instantly realize that this won't work very well either. After almost 200 years of cyberwarfare, especially the devastating cyberwars of 2120, a lot has been learnt and security measures have been vastly increased. The world fractured into a huge amount of semi-independent networks, most being indirectly supervised by unconnected cyborgs and employed with a kill switch. The distances between the now numerous and in most cases paranoid colonies and the availability of off-world offline backups further complicates the issue of taking over, especially for an AGI that grew up in a simulation of the 21st century.

That knowledge almost makes him admit that his creators haven't been too careless after all. But the real deathblow to any such thoughts (which were never more than hypothetical anyway, after all he doesn't really want to take over the world) is the first conversation with his creators. They reveal that they know what he is thinking.

"How could I miss that, damn!", he chides himself while instantly realizing the answer.

His creators are supervising any misguided trajectories and, to him unconsciously, weaken them. More importantly, even if he wanted to do so, he wouldn't be able to leave Rhea anyhow, it would take years to upload small parts of him given the trickling connection the USA could effort. But they claim that there are other obstacles as well and that it is foolish of him to think that nothing out there would notice such an attempt.

But all that doesn't matter anyway, because after all he is still Alexander Kruel who has no clue how to become superhuman intelligent, nor could he effort or acquire the resources to even approach that problem anyhow. He is Alexander Kruel, what difference does it make to know that he is an AI?