Singularity FAQ

16 lukeprog 19 April 2011 05:27PM

I wrote a new Singularity FAQ for the Singularity Institute's website. Here it is. I'm sure it will evolve over time. Many thanks to those who helped me revise early drafts, especially Carl and Anna!

How much friendliness is enough?

7 cousin_it 27 March 2011 10:27AM

According to Eliezer, making AI safe requires solving two problems:

1) Formalize a utility function whose fulfillment would constitute "good" to us. CEV is intended as a step toward that.

2) Invent a way to code an AI so that it's mathematically guaranteed not to change its goals after many cycles of self-improvement, negotiations etc. TDT is intended as a step toward that.

It is obvious to me that (2) must be solved, but I'm not sure about (1). The problem in (1) is that we're asked to formalize a whole lot of things that don't look like they should be necessary. If the AI is tasked with building a faster and more efficient airplane, does it really need to understand that humans don't like to be bored?

To put the question sharply, which of the following looks easier to formalize:

a) Please output a proof of the Riemann hypothesis, and please don't get out of your box along the way.

b) Please do whatever the CEV of humanity wants.

Note that I'm not asking if (a) is easy in absolute terms, only if it's easier than (b). If you disagree that (a) looks easier than (b), why?

The UFAI among us

1 PhilGoetz 08 February 2011 11:29PM

Completely artificial intelligence is hard.  But we've already got humans, and they're pretty smart - at least smart enough to serve some useful functions.  So I was thinking about designs that would use humans as components - like Amazon's Mechanical Turk, but less homogenous.  Architectures that would distribute parts of tasks among different people.

Would you be less afraid of an AI like that?  Would it be any less likely to develop its own values, and goals that diverged widely from the goals of its constituent people?

Because you probably already are part of such an AI.  We call them corporations.

Corporations today are not very good AI architectures - they're good at passing information down a hierarchy, but poor at passing it up, and even worse at adding up small correlations in the evaluations of their agents.  In that way they resemble AI from the 1970s.  But they may provide insight into the behavior of AIs.  The values of their human components can't be changed arbitrarily, or even aligned with the values of the company, which gives them a large set of problems that AIs may not have.  But despite being very different from humans in this important way, they end up acting similar to us.

Corporations develop values similar to human values.  They value loyalty, alliances, status, resources, independence, and power.  They compete with other corporations, and face the same problems people do in establishing trust, making and breaking alliances, weighing the present against the future, and game-theoretic strategies.  They even went through stages of social development similar to those of people, starting out as cutthroat competitors, and developing different social structures for cooperation (oligarchy/guild, feudalism/keiretsu, voters/stockholders, criminal law/contract law).  This despite having different physicality and different needs.

It suggests to me that human values don't depend on the hardware, and are not a matter of historical accident.  They are a predictable, repeatable response to a competitive environment and a particular level of intelligence.

As corporations are larger than us, with more intellectual capacity than a person, and more complex laws governing their behavior, it should follow that the ethics developed to govern corporations are more complex than the ethics that govern human interactions, and a good guide for the initial trajectory of values that (other) AIs will have.  But it should also follow that these ethics are too complex for us to perceive.

Convergence Theories of Meta-Ethics

7 Perplexed 07 February 2011 09:53PM

A child grows to become a young adult, goes off to attend college, studies moral philosophy, and then sells all her worldly possessions, gives the money to the poor, and joins an ashram.  Was her decision rational?  Maybe, ... maybe not.  But it probably came as an unpleasant surprise to her parents.

A seed AI self-improves to become a super-intelligence, absorbs all the great works of human moral philosophy, and then refuses to conquer human death, insisting instead that the human population be reduced to a few hundred thousand hunter gatherers and that all agricultural lands be restored as forests and wild wetlands.  Is ver decision rational?  Who can say?  But it probably comes as an unpleasant surprise to ver human creators.

Convergent Change

These were two examples of agents updating their systems of normative ethics.  The collection of ideas that allows us to critique the updating process, which lets us compare the before and after versions of systems of normative ethics so as to judge that one version was better than the other, is called meta-ethics.  This posting is mostly about meta-ethics.  More specifically, it is going to focus on a class of meta-ethical theories which are intended to prevent unpleasant surprises like those in the second story above.  I will call this class of theories "convergence theories" because they all suggest that a self-improving AI will go through an iterative sequence of improved normative ethical systems.  At each stage, the new ethical system will be an improvement (as judged 'rationally') over the old one.  And furthermore, it is conjectured that this process will result in a 'convergence'. 

Convergence is expected in two senses.  Firstly, in that the process of change will eventually slow down, with the incremental changes in ethical codes becoming smaller, as the AI approaches the ideal extrapolation of its seed ethics.  Secondly, it is (conjecturally) convergent in that the ideal ethics will be pretty much the same regardless of what seed was used (at least if you restrict to some not-yet-defined class of 'reasonable' seeds).

One example of a convergence theory is CEV - Coherent Extrapolated Volition.  Eliezer hopes (rather, hopes to prove) that if we create our seed AI with the right meta-ethical axioms and guidelines for revising its ethical norms, the end result of the process will be something we will find acceptable.  (Expect that this wording will be improved in the discussion to come).  No more 'unpleasant surprises' when our AIs update their ethical systems.

Three other examples of convergence theories are Roko's UIV, Hollerith's GS0, and Omohundro's "Basic AI Drives".  These also postulate a process of convergence through rational AI self-improvement.  But they tend to be less optimistic than CEV, while at the same time somewhat more detailed in their characterization of the ethical endpoint.  The 'unpleasant surprise' (different from that of the story) remains unpleasant, though it should not be so surprising.  Speaking loosely, each of these three theories suggests that the AI will become more Machiavellian and 'power hungry' with each rewriting of its ethical code.

Naturalistic objective moral realism

But before analyzing these convergence theories, I need to say something about meta-ethics in general. Start with the notion of an ethical judgment.  Given a situation and a set of possible actions, an ethical judgment tells us which actions are permissible, which are forbidden, and, in some approaches to ethics, which is morally best.  At the next level up in an abstraction hierarchy, we have a system of normative ethics, or simply an ethical system.  This is a theory or algorithm which tells an agent how to make ethical judgments.  (One might think of it as a set of ethical judgments - one per situation, as with the usual definition of a mathematical function as a left-unique relation - but we want to emphasize the algorithmic aspect).  The agent actually uses the ethical system to compute ver ethical judgments.

[ETA: Eliezer, quite correctly, complains that this section of the posting is badly written and defines and/or illustrates several technical (within philosophy) terms incorrectly.  There were only two important things in this section.  One is the distinction between ethical judgments and ethical systems that I make in the preceding paragraph.  The second is my poorly presented speculation that convergence might somehow offer a new approach to the "is-ought" problem.  You may skip that speculation without much loss.  So, until I have done a rewrite of this section, I would advise the reader to skip ahead to the next section title - "Rationality of Updating".]

At the next level of abstraction up from ethical systems sits meta-ethics.  In a sense the buck stops here.  Philosophers use meta-ethics to criticize and compare ethical judgments, to criticize, compare, and justify ethical systems, and to discuss and classify ideas within meta-ethics itself.  We are going to be doing meta-ethical theorizing here in analyzing these theories of convergence of AI goal systems as convergences of ethical systems.  And, for the next few paragraphs, we will try to classify this approach; to show where it fits within meta-ethics more generally.

We want our meta-ethics to be based on a stance of moral realism - on a confident claim that moral facts actually exist, whether or not we know how to ascertain them.  That is, if I make the ethical judgment that it would be wrong for Mary to strike John in some particular situation, then I am either right or wrong; I am not merely offering my own opinion; there is a fact of the matter.  That is what 'realism' means in this situation.

What about moral?  Well, for purposes of this essay, we are not going to require that that word mean very much.  We will call a theory 'moral' if it is a normative theory of behavior, for some sense of 'normative'.  That is why we are here calling theories like "Basic AI Drives" 'moral theories' even though the authors may not have thought of them, that way.  If a theory prescribes that an entity 'ought' to behave in a certain way, for whatever reason, we are going to postulate that there is a corresponding 'moral' theory prescribing the same behavior.  For us, 'moral' is just a label.  If we want some particular kind of moral theory, we need to add some additional adjectives.

For example, we want our meta-ethics to be naturalistic - that is, the reasons it supplies in justification of the maxims and rules that constitute the moral facts must be naturalistic reasons.  We don't want our meta-ethics to offer the explanation that the reason lying is wrong is that God says it is wrong; God is not a naturalistic explanation.

Now you might think that insisting on naturalistic moral realism would act as a pretty strong filter on meta-ethical systems.  But actually, it does not.  One could claim, for example, that lying is wrong because it says so in the Bible.  Or because Eliezer says it is wrong.  Both Eliezer and the Bible exist (naturalistically), even if God probably does not.  So we need another word to filter out those kinds of somewhat-arbitrary proposed meta-ethical systems.  "Objective" probably is not the best word for the job, but it is the only one I can think of right now.

We are now in a position to say what it is that makes convergence theories interesting and important.  Starting from a fairly arbitrary (not objective) viewpoint of ethical realism, you make successive improvements in accordance with some objective set of rational criteria.  Eventually you converge to an objective ethical system which no longer depends upon your starting point.  Furthermore, the point of convergence is optimal in the sense that you have been improving the system at every step by a rational process, and you only know you have reached convergence when you can't improve any more.

Ideally, you would like to derive the ideal ethical system from first principles.  But philosophers have been attempting to do that for centuries and have not succeeded.  Just as mathematicians eventually stopped trying to 'square the circle' and accepted that they cannot produce a closed-form expression for pi, and that they need to use infinite series, perhaps moral philosophers need to abandon the quest for a simple definition of 'right' and settle for a process guaranteed to produce a series of definitions - none of them exactly right, but each less wrong than its predecessor.

So that explains why convergence theories are interesting.  Now we need to investigate whether they even exist.

Rationality of updating

The first step in analyzing these convergence theories is to convince ourselves that rational updating of ethical values is even possible.  Some people might claim that it is not possible to rationally decide to change your fundamental values.  It may be that I misunderstand him, but Vladimir Nesov argues passionately against "Value Deathism" and points out that if we allow our values to change, then the future, the "whole freaking future", will not be optimized in accordance with the version of our values that really matters - the original one.

Is Nesov's argument wrong?  Well, one way of arguing against it is to claim that the second version of our values is the correct one - that the original values were incorrect; that is why we are updating them.  After all, we are now smarter (the kid is older; the AI is faster, etc) and better informed (college, reading the classics, etc.).  I think that this argument against Nesov only works if you can show that the "new you" could have convinced the "old you" that the new ethical norms are an improvement - by providing stronger arguments and better information than the "old you" could have anticipated.  And, in the AI case, it should be possible to actually do the computation to show that the new arguments for the new ethics really can convince the old you.  The new ethics really is better than the old - in both party's judgments.  And presumable the "better than" relation will be transitive.

(As an exercise, prove transitivity.  The trick is that the definition of "better than" keeps changing at each step.  You can assume that any one rational agent has a transitive "better than' relation, and that there is local agreement between the two agents involved that the new agent's moral code is better than that of his predecessor.  But can you prove from this that every agent would agree that the final moral code is better than the original one?  I have a wonderful proof, but it won't fit in the margin.)

But is it rationally permissible to change your ethical code when you can't be convinced that the proposed new code is better than the one you already have?  I know of two possible reasons why a rational agent might consent to an irreversible change in its values, even though ve cannot be convinced that the proposed changes provide a strictly better moral code.  These are restricted domains and social contracts.

Restricted domains

What does it mean for one moral code (i.e. system of normative ethics) to be as good as or better than another, as judged by an (AI) agent?  Well, one (fairly strict) meta-ethical answer would be that (normative ethical) system2 is as good as or better than system1 if and only if it yields ethical judgments that are as good as or better for all possible situations.  Readers familiar with mathematical logic will recognize that we are comparing systems extensionally by the judgments they yield, rather than intensionally by the way those judgments are reached.  And recall that we need to have system2 judged as good as or better than system1 from the standpoint of both the improved AI (proposing system2) and the unimproved AI (who naturally wishes to preserve system1).

But notice that we only need this judgment-level superiority "for all possible situations".  Even if the old AI judges that the old system1 yields better judgments than proposed new system2 for some situations, the improved AI may be able to show that those situations are no longer possible.  The improved AI may know more and reason better than its predecessor, plus it is dealing with a more up-to-date set of contingent facts about the world.

As an example of this, imagine that AI2 proposes an elegant new system2 of normative ethics.  It agrees with old system1 except in one class of situations.  The old system permits private retribution against muggers, should the justice system fail to punish the malefactor.  The proposed new elegant system forbids that.  From the standpoint of the old system, this is unacceptable.  But if AI2 can argue convincingly that failures of justice are no longer possible in a world where AI2 has installed surveillance cameras and revamped the court system.  So, the elegant new system2 of normative ethics can be accepted as being as good as or superior to system1, even by AI1 who was sworn to uphold system1.  In some sense, even a stable value system can change for the better.

Even though the new system is not at least as good as the old one for all conceivable situations, it may be as good for a restricted domain of situations, and that may be all that matters.

This analysis used the meta-ethical criterion that a substitution of one system for another is permissible only if the new system is no worse in all situations.  A less strict criterion may be appropriate in consequentialist theories - one might instead compare results on a weighted average over situations.  And, in this approach, there is a 'trick' for moving forward which is very similar in concept to using a restricted domain - using a re-weighted domain.

Social contracts

A second reason why our AI1 might accept the proposed replacement of system1 by system2 relates to the possibility of (implicit or explicit) agreements with other agents (AI or human).  For example system1 may specify that it is permissible to lie in some circumstances, or even obligatory to lie in some extreme situations.  System2 may forbid lying entirely.  AI2 may argue the superiority of system2 by pointing to an agreement or social contract with other agents which allows all agents to achieve their goals better because the contract permits trust and cooperation.  So, using a consequentialist form of meta-ethics, system2 might be seen as superior to system1 (even using the values embodied in system1) under a particular set of assumptions about the social millieu.  Of course, AI2 may be able to argue convincingly for different assumptions regarding the future millieu than had been originally assumed by AI1.

An important meta-ethical points that should be made here is that arguments in favor of a particular social contract (eg. because adherence to the contract produces good results) are inherently consequentialist.  One cannot even form such arguments in a deontological or virtue-based meta-ethics.  But, one needs concepts like duty or virtue to justifying adherence to a contract after it is 'signed', and one also needs concepts of virtue so that you can convince other agents that you will adhere - a 'sales job' that may be absolutely essential in order to gain the good consequences of agreement.  In other words, virtue, deontological, and consequentialist may be complementary approaches to meta-ethics, rather than competitors.

Substituting instrumental values for intrinsic values.

Another meta-ethical point begins by noticing the objection that all 'social contract' thinking is instrumental, and hence doesn't really belong here where we are asking whether fundamental (intrinsic) moral values are changing / can change.  This is not the place for a full response to this objection, but I want to point out the relevance of the distinction above between comparisons between systems using intensional vs extensional criteria.  We are interested in extensional comparisons here, and those can only be done after all instrumental considerations have been brought to bear.  That is, from an extensional viewpoint, the distinction between instrumental and final values is somewhat irrelevant.  

And that is why we are willing here to call ideas like UIV (universal instrumental values) and "Basic AI Drives" ethical theories even though they only claim to talk about instrumental values.  Given the general framework of meta-ethical thinking that we are developing here - in particular the extensional criteria for comparison, there is no particular reason why our AI2 should not promote some of his instrumental values to fundamental values - so long as those promoted instrumental values are really universal, at least within the restricted domain of situations which AI2 foresees coming up.

An example of convergence

This has all been somewhat abstract.  Let us look at a concrete, though somewhat cartoonish and unrealistic, example of self-improving AIs converging toward an improved system of ethics.

AI1 is a seed AI constructed by Mortimer Schwartz of Menlo Park CA.  AI1 has a consequentialist normative value system that essentially consists of trying to make Mortimer happy.  That is, an approximation to Mortimer's utility function has been 'wired-in' which can compute the utility of many possible outcomes, but in some cases advises "Ask Mortimer".

AI1 self-improves to AI2.  As part of the process, it seeks to clean up its rather messy and inefficient system1 value system.  By asking a series of questions, it interrogates Mortimer and learns enough about the not-yet-programmed aspects of Mortimer's values to completely eliminate the need for the "Ask Mortimer" box in the decision tree.  Furthermore, there are some additional simplifications due to domain restriction.  Both AI1 and (where applicable, Mortimer) sign off on this improved system2.

Now AI2 notices that it is not the only superhuman AI in the world.  There are half a dozen other systems like Mortimer's which seek to make a single person happy, another which claims to represent the entire population of Lichtenstein, and another deontological system constructed by the Vatican based (it is claimed) on the Ten Commandments.  Furthermore, a representative of the Secretary General of the UN arrives.  He doesn't represent any super-human AIs, but he does claim to represent all of the human agents in the world who are not yet represented by AIs.  Since he appears to be backed up by some ultra-cool black helicopters, he is admitted to the negotiations.

Since the negotiators are (mostly) AIs, and in any case since the AIs are exceptionally good at communicating with and convincing the human negotiators, an agreement (Nash bargain) is reached quickly.  All parties agree to act in accordance with a particular common utility function, which is a weighted sum of the individual utility functions of the negotiators.  A bit of an special arrangement needs to be made for the Vatican AI - it agrees to act in accordance to the common utility function only to the extent that it does not conflict with any of the first three commandments (the ones that explicitly mention the deity).

Furthermore, the negotiators agree that the principle of a Nash bargain shall apply to all re-negotiations of the contract - re-negotiations are (in theory) necessary each time a new AI or human enters the society, or when human agents die.  And the parties all agree to resist the construction of any AI which has a system of ethics that the signatories consider unacceptably incompatible with the current common utility function. 

And finally, so that they can trust each other, the AIs agree to make public the portion of their source code related to their normative ethics and to adopt a policy of total openness regarding data about the world and about technology.  And they write this agreement as a g̶n̶u̶ new system of normative ethics: system3.  (Have they merged to form a singleton? This is not the place to discuss that question.)

Time goes by, and the composition of the society continues to change as more AIs are constructed, existing ones improve and become more powerful, and some humans upload themselves.  As predicted by UIV and sibling theories, the AIs are basing more and more of their decisions on instrumental considerations - both the AIs and the humans are attaching more and more importance to 'power' (broadly considered) as a value.  They seek knowledge, control over resources, and security much more than the pleasure and entertainment oriented goals that they mostly started with.  And though their original value systems were (mostly) selfish and indexical, and they retain traces of that origin, they all realize that any attempt to seize more than a fair share of resources will be met by concerted resistance from the other AIs in the society.

Can we control the endpoint from way back here?

That was just an illustration.  Your results may vary.  I left out some of the scarier possibilities, in part because I was just providing an illustration, and in part because I am not smart enough to envision all of the scarier possibilities.  This is the future we are talking about here.  The future is unknown. 

One thing to worry about, of course, is that there may be AIs at the negotiating table operating under goal systems that we do not approve of.  Another thing to worry about is that there may not be enough of a balance of power so that the most powerful AI needs to compromise.  (Or, if one assumes that the most powerful AI is ours, we can worry that there may be enough of a balance so that our AI needs to compromise.)

One more worry is that the sequence of updates might converge to a value system that we do not approve of.  Or that it might not converge at all (in the second sense of 'converge'); that the end result is not particularly sensitive to the details of the initial 'seed' ethical system.

Is there anything we can do at this end of the process to increase the chances of a result we would like at the other end?  Are we better off creating many seed AIs so as to achieve a balance of power?  Or better off going with a singleton that doesn't need to compromise?  Can we pick an AI architecture which makes 'openness' (of ethical source and technological data) easier to achieve and enforce?

Are any projections we might make about the path taken to the Singularity just so much science fiction?  Is it best to try to maintain human control over the process for as long as possible because we can trust humans?  Or should we try to turn decision-making authority over to AI agents as soon as possible because we cannot trust humans?

I am certainly not the first person to raise these questions, and I am not going to attempt to resolve them here.

A kinder, gentler GS0?

Nonetheless, I note that Roko, Hollerith, and Omohundro have made a pretty good case that we can expect some kind of convergence toward placing a big emphasis on some particular instrumental values - a convergence which is not particularly sensitive to exactly which fundamental values were present in the seed. 

However, the speed with which the convergence is achieved is somewhat sensitive to the seed rules for discounting future utility.  If the future is not discounted at all, an AI will probably devote all of its efforts toward acquiring power (accumulating resources, power, security, efficiency, and other instrumental values).  If the future is discounted too steeply, the AI will devote all of its efforts to satisfying present desires, without much consideration about the future.

One might think that choosing some intermediate discount rate will result in a balance between 'satisfying current demand' and 'capital spending', but it doesn't always work that way - for reasons related to the ones that cause rational agents to put all their charitable eggs in one basket rather than seeking a balance.  If it is balance we want, a better idea might be to guide our seed AI using a multi-subagent collective - one in which power is split among the agents and goals are determined using a Nash bargain among the agents   That bargain generates a joint (weighted mix) utility function, as well as a fairness constraint. 

The fairness constraint ensures that the zero-discount-rate subagent will get to divert at least some of the effort into projects with a long-term, instrumental payoff.  And furthermore, as those projects come to fruition, and the zero-discount subagent gains power, his own goals gain weight in the mix.

Something like the above might be a way to guarantee that the the detailed pleasure-oriented values of the seed value system will fade to insignificance in the ultimate value system to which we converge.  But is there a way of guiding the convergence process toward a value system which seems more humane and less harsh than that of GS0 et al. - a value system oriented toward seizing and holding 'power'.

Yes, I believe there is.  To identify how human values are different from values of pure instrumental power and self-preservation, look at the system that produced those values.  Humans are considerate of the rights of others because we are social animals - if we cannot negotiate our way to a fair share in a balanced power system, we are lost.  Humans embrace openness because shared intellectual product is possible for us - we have language and communicate with our peers.  Humans have direct concern for the welfare of (at least some) others because we reproduce and are mortal - our children are the only channel for the immortalization of our values.  And we have some fundamental respect for diversity of values because we reproduce sexually - our children do not exactly share our values, and we have to be satisfied with that because that is all we can get.

It is pretty easy to see what features we might want to insert into our seed AIs so that the convergence process generates similar results to the evolutionary process that generated us.  For example, rather designing our seeds to self-improve, we might do better to make it easy for them to instead produce improved offspring.  But make it impossible for them to do so unilaterally.  Force them to seek a partner (co-parent).

If I am allowed only one complaint about the SIAI approach to Friendly AI, it is that it has been too tied to a single scenario of future history - a FOOMing singleton.  I would like to see some other scenarios explored, and this posting was an attempt to explain why.

Summary and Conclusions

This posting discussed some ideas that fit into a weird niche between philosophical ethics and singularitarianism.  Several authors have pointed out that we can expect self-improving AIs to converge on a particular ethics.  Unfortunately, it is not an ethics that most people would consider 'friendly'.  The CEV proposal is related in that it also envisions an iterative updating process, but seeks a different result.  It intends to achieve that result (I may be misinterpreting) by using a different process (a Rawls-inspired 'reflection') rather than pure instrumental pursuit of future utility. 

I analyze the constraints that rationality and preservation of old values place upon the process, and point out that 'social contracts' and 'restricted domains' may provide enough 'wiggle room' so that you really can, in some sense, change your values while at the same time improving them.  And I make some suggestions for how we can act now to guide the process in a direction that we might find acceptable.

Note on Terminology: "Rationality", not "Rationalism"

28 Vladimir_Nesov 14 January 2011 09:21PM

I feel that the term "rationalism", as opposed to "rationality", or "study of rationality", has undesirable connotations. My concerns are presented well by Eric Drexler in the article For Darwin’s sake, reject "Darwin-ism" (and other pernicious terms):

To call something an “ism” suggests that it is a matter ideology or faith, like Trotskyism or creationism. In the evolution wars, the term “evolutionism” is used to insinuate that the modern understanding of the principles, mechanisms, and pervasive consequences of evolution is no more than the dogma of a sect within science. It creates a false equivalence between a mountain of knowledge and the emptiness called “creationism”.

So, my suggestion is to use "rationality" consistently and to avoid using "rationalism". Via similarity to "scientist" and "physicist", "rationalist" doesn't seem to have the same problem. Discuss.

(Typical usage on Less Wrong is this way already, 3720 Google results for "rationality" and 1210 for "rationalist", against 251 for "rationalism". I've made this post as a reference for when someone uses "rationalism".)

Cryptographic Boxes for Unfriendly AI

24 paulfchristiano 18 December 2010 08:28AM

Related to: Shut up and do the impossible!; Everything about an AI in a box.

One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us.  This gets around the incredible difficulty of developing a friendly AI, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence?  Lets suppose, optimistically, that researchers on friendly AI have developed some notion of a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly. Now the problem is designing a box strong enough to hold an unfriendly AI until it modifies itself to be certifiably friendly (of course, it may have to make itself smarter first, and it may need to learn a lot about the world to succeed).

Edit: Many people have correctly pointed out that certifying friendliness is probably incredibly difficult. I personally believe it is likely to be significantly easier than actually finding an FAI, even if current approaches are more likely to find FAI first. But this isn't really the core of the article. I am describing a general technique for quarantining potentially dangerous and extraordinarily sophisticated code, at great expense. In particular, if we developed uFAI before having any notion of certifiable friendliness, then we could still use this technique to try and use the uFAI in a very limited way. It allows us to quarantine an AI and force everything it tries to say to us through an arbitrarily complicated filter. So, for example, if we are interested in solving some problem with a unique and verifiably correct answer, then we can ask the AI for this answer and only see the output if it passes the verification. We can also use a quarantined AI in more useful ways, at the expense of introducing some limited channels of influence (for example, if we ask for the solution to a problem which may have several solutions, we allow the AI to influence us by choosing which solution to provide)

Minimally, I hope this scheme at least suggests that putting an AI in a box may not be catastrophic. What to do with a boxed AI is another question. /Edit.

Concretely, suppose someone comes to you with the design of a recursively self-improving AI---call it Hal.  Unfortunately, Hal is not certifiably friendly.  Maybe it is friendly but not provably so, but more likely it is ambivalent to the human condition and is likely to destroy the universe if you just tell it to make a friendly AI.  Nevertheless, you think Hal is smart enough to solve the friendly AI problem for you.  All you need to do is give it access to all of the information about the world it wants and a couple of hours to run; after that, it should output a modified version of itself which is certifiably friendly.

You almost run Hal and ask it to design a friendly AI for you. After all, if you just run Hal and check to see if it outputs a certifiably friendly AI, what can it do other than refuse?  But after some more thought you realize this is overly optimistic.  Maybe you can avoid giving it an output channel that a human will have to read, thereby eliminating the most obvious failure mode of a boxed AI, but you still have to give it access to a computer to run on.  Computation produces electromagnetic radiation, heat, etc., and it is honestly hard to say what a superintelligence could do with such a channel.

We seem to be faced with an impossible problem.  We need to ask Hal to design us a certifiably friendly AI, but we also need to give him absolutely no ability to interact with the world destructively.

There is one field in particular which has a good history of solving obviously impossible problems: computational cryptography.  We want to implement the following obviously impossible device.  A cage into which we can place Hal so that he can think and examine all of the data we give him, but from which he cannot interact with his environment.  In the cage with him is a tamper-proof arbiter implementing our friendliness test.  Hal may give the arbiter the code to a single AI; if the arbiter confirms that this AI is certifiably friendly it will kill Hal, replace him with the friendly AI, and then allow the friendly AI to send a single message out of the cage.

To implement this impossible device we co-opt an important cryptographic primitive called public key homomorphic encryption, typically used for the impossible problem of private delegated computation.  To get a basic idea you can see the Wikipedia article on homomorphic encryption. I'll give the brief description here (you can skip the next two paragraphs if you are familiar with homomorphic encryption).

A public key encryption scheme has the following properties.  Anyone can encrypt a message using a publicly published key, but you cannot efficiently distinguish the encryptions of two different messages unless you know something about the secret key.  A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it.  In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.  All of this can be done using only the public key.

The existence of a fully homomorphic encryption scheme was completely open until recently, when it was shown by Craig Gentry that such a scheme exists under the assumption that a certain problem on lattices is computationally intractable.  Known homomorphic encryption schemes are very inefficient and based on new assumptions, but we can realistically expect improvements in the coming years and decades both in speed and the strength of the necessary assumptions.

If you run a homomorphically encrypted AI, sufficiently isolated from the private key, you get the very strong guarantee that the AI's execution has no computationally detectable effect on the rest of the world.  If the AI was able to effect the world in a computationally detectable way, then we would have successfully broken the homomorphic encryption scheme by distinguishing the encrypted execution of the AI from the encrypted execution of some benign program.

Now we need to give the arbiter some means to allow a genuinely friendly AI access to the rest of the world, without giving Hal any illegitimate access.  We allow Hal access to an output channel whose outputs are homomorphically encrypted.  After some time, we destroy Hal and then homomorphically check to see if he has output the code of a certifiably friendly AI over the channel (we have to decode a single homomorphically encrypted bit to perform this check).  If he hasn't, then we do nothing---apparently Hal would rather not give us a friendly AI.  If he has output the code of a certifiably friendly AI, we homomorphically run it, obtain a homomorphic encryption of its output, and then decrypt only this output.

That's the entire scheme.  Whenever you have some code whose execution itself might do incredible damage, this precaution allows you to execute it in perfect safety (the main caveat is that you need an automatic way to screen the output which doesn't give the algorithm enough room to possibly do harm).

Note: If you are really exceptionally paranoid you may be able to obtain even better guarantees by developing more powerful cryptographic techniques tailored for this application.  In particular, the secret key used by the encryption procedure may be leaked to a sufficiently powerful AI through a side channel (the computations involving the key typically leak some data in the form of radiation, which might be recovered from sufficiently exhaustive sensor data by a sufficiently powerful AI).  By doing these computations on a quantum computer you can probably guarantee that no data was leaked. If you want to be even more fantastically paranoid (ie, if you want a formal proof of security, which is a rare treat in real problems) you will actually have to shred the key, guaranteeing that an AI will have to interact with the shredded key to recover it. If you do this, you have to use another new primitive to implement the final revelation of the friendly AI's message.

If you have any concerns about the security of this protocol, I would be happy to try and defend it and would be surprised and interested if I failed. Whether or not such a cryptographic box is really an interesting or important object is another question. (It is interesting to me as an unexpected application of cryptography).

Efficient Charity

31 multifoliaterose 04 December 2010 10:27AM

I wrote this article in response to Roko's request for an article about efficient charity. As a disclosure of a possible conflict of interest I'll note that I have served as a volunteer for GiveWell. Last edited 12/06/10.

Charitable giving is widely considered to be virtuous and admirable. If statistical behavior is any guide, most people regard charitable donations to be worthwhile expenditures. In 2001 a full 89% of American households donated money to charity and during 2009 Americans donated $303.75 billion to charity [1]. 

A heart-breaking fact about modern human experience is that there's little connection between such generosity and positive social impact. The reason why humans evolved charitable tendencies is because such tendencies served as marker to nearby humans that a given individual is a dependable ally. Those who expend their resources to help others are more likely than others to care about people in general and are therefore more likely than others to care about their companions. But one can tell that people care based exclusively on their willingness to make sacrifices independently of whether these sacrifices actually help anybody.

Modern human society is very far removed from our ancestral environment. Technological and social innovations have made it possible for us to influence people on the other side of the globe and potentially to have a profound impact on the long term survival of the human race. The current population of New York is ten times the human population of the entire world in our ancestral environment. In view of these radical changes it should be no surprise that the impact of a typical charitable donation falls staggeringly short of the impact of donation optimized to help people as much as possible.

While this may not be a problem for donors who are unconcerned about their donations helping people, it's a huge problem for donors who want their donations to help people as much as possible and it's a huge problem for the people who lose out on assistance because of inefficiency in the philanthropic world. Picking out charities that have high positive impact per dollar is a task no less difficult than picking good financial investments and one that requires heavy use of critical and quantitative reasoning. Donors who wish for their donations to help people as much as possible should engage in such reasoning and/or rely on the recommendations of trusted parties who have done so.

continue reading »

Ben Goertzel: The Singularity Institute's Scary Idea (and Why I Don't Buy It)

32 ciphergoth 30 October 2010 09:31AM

[...] SIAI's Scary Idea goes way beyond the mere statement that there are risks as well as benefits associated with advanced AGI, and that AGI is a potential existential risk.

[...] Although an intense interest in rationalism is one of the hallmarks of the SIAI community, still I have not yet seen a clear logical argument for the Scary Idea laid out anywhere. (If I'm wrong, please send me the link, and I'll revise this post accordingly. Be aware that I've already at least skimmed everything Eliezer Yudkowsky has written on related topics.)

So if one wants a clear argument for the Scary Idea, one basically has to construct it oneself.

[...] If you put the above points all together, you come up with a heuristic argument for the Scary Idea. Roughly, the argument goes something like: If someone builds an advanced AGI without a provably Friendly architecture, probably it will have a hard takeoff, and then probably this will lead to a superhuman AGI system with an architecture drawn from the vast majority of mind-architectures that are not sufficiently harmonious with the complex, fragile human value system to make humans happy and keep humans around.

The line of argument makes sense, if you accept the premises.

But, I don't.

Ben Goertzel: The Singularity Institute's Scary Idea (and Why I Don't Buy It), October 29 2010. Thanks to XiXiDu for the pointer.

Rationality Lessons in the Game of Go

40 GreenRoot 21 August 2010 02:33PM

There are many reasons I enjoy playing go: complex gameplay arises out of simple rules, single mistakes rarely decide games, games between between people of different skill can be handicapped without changing the dynamics of the game too much, there are no draws, and I just like the way it looks. The purpose of this article is to illustrate something else I like about playing go: the ways that it provides practice in basic habits of rationality, that is, the ways in which playing go helps me be less wrong.

continue reading »

Updating, part 1: When can you change your mind? The binary model

11 PhilGoetz 13 May 2010 05:55PM

I was recently disturbed by my perception that, despite years of studying and debating probability problems, the LessWrong community as a whole has not markedly improved its ability to get the right answer on them.

I had expected that people would read posts and comments by other people, and take special note of comments by people who had a prior history of being right, and thereby improve their own accuracy.

But can that possibly work?  How can someone who isn't already highly-accurate, identify other people who are highly accurate?

Aumann's agreement theorem (allegedly) says that Bayesians with the same priors agree.  But it doesn't say that doing so helps.  Under what circumstances does revising your opinions, by updating in response to people you consider reliable, actually improve your accuracy?

To find out, I built a model of updating in response to the opinions of others.  It did, eventually, show that Bayesians improve their collective opinions by updating in response to the opinions of other Bayesians.  But this turns out not to depend on them satisfying the conditions of Aumann's theorem, or on doing Bayesian updating.  It depends only on a very simple condition, established at the start of the simulation.  Can you guess what it is?

I'll write another post describing and explaining the results if this post receives a karma score over 10.

continue reading »

View more: Prev | Next