Safety engineering, target selection, and alignment theory

17 So8res 31 December 2015 03:43PM

This post is the latest in a series introducing the basic ideas behind MIRI's research program. To contribute, or learn more about what we've been up to recently, see the MIRI fundraiser page. Our 2015 winter funding drive concludes tonight (31 Dec 15) at midnight.


 

Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more "robust and beneficial." In this post, I distinguish three kinds of direct research that might be thought of as "AI safety" work: safety engineering, target selection, and alignment theory.

Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon?

In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints.

Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments.

A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn't exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering.

Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We'll use this as an analogy to illustrate MIRI's research focus, the problem of artificial intelligence alignment.

continue reading »

Travel Through Time to Increase Your Effectiveness

38 tanagrabeast 23 August 2015 01:32AM

I am a time traveler.

I hold this belief not because it is true, but because it is useful. That it also happens to be true -- we are all time travelers, swept along by the looping chrono-currents of reality that only seem to flow in one direction -- is largely beside the point.

In the literature of instrumental rationality, I am struck by a pattern in which tips I find useful often involve reframing an issue from a different temporal perspective. For instance, when questioning whether it is worth continuing an ongoing commitment, we are advised to ask ourselves "Knowing what I know now, if I could go back in time, would I make the same choice?"Also, when embarking on a new venture, we are advised to perform a "pre-mortem", imagining ourselves in a future where it didn't pan out and identifying what went wrong.2 This type of thinking has a long tradition. Whenever we use visualization as a tool for achieving goals, or for steeling ourselves against the worst case scenarios,3 we are, in a sense, stepping outside the present.

To the degree that intelligence is the ability to model the universe and "search out paths through probability to any desired future" we should not be surprised that mental time travel comes naturally to us. And to the degree that playing to this strength has already produced so many useful tips, I think it is worth experimenting with it in search of other tools and exploits.

Below are a few techniques I've been developing over the last two years that capitalize on how easy it is to mentally travel through time. I fully admit that they simply "re-skin" existing advice and techniques. But it's possible that you, my fellow traveller, may find, as I do, that these skins easier to slip into.

continue reading »

Help Build a Landing Page for Existential Risk?

12 Mass_Driver 30 July 2015 06:03AM

The Big Orange Donate Button

Traditional charities, like Oxfam, Greenpeace, and Amnesty International, almost all have a big orange button marked "Donate" right on the very first page that loads when you go to their websites. The landing page for a major charity usually also has vivid graphics and some short, easy-to-read text that tells you about an easy-to-understand project that the charity is currently working on.

I assume that part of why charities have converged on this design is that potential donors often have short attention spans, and that one of the best ways to maximize donations is to make it as easy as possible for casual visitors to the website to (a) confirm that they approve of the charity's work, and (b) actually make a donation. The more obstacles you put between google-searching on the name of a charity and the 'donate' button, the more people will get bored or distracted, and the fewer donations you'll get.

Unfortunately, there doesn't seem to be any such streamlined interface for people who want to learn about existential risks and maybe donate some money to help prevent them. The website on existential risk run by the Future of Humanity Institute reads more like a syllabus or a CV than like an advertisement or a brochure -- there's nowhere to donate money; it's just a bunch of citations. The Less Wrong wiki page on x-risk is more concerned with defining and analyzing existential risks than it is with explaining, in simple concrete language, what problems currently threaten to wipe out humanity. The Center for the Study of Existential Risk has a landing page that focuses on a video of a TED talk that goes on for a full minute before mentioning any specific existential risks, and if you want to make a donation you have to click through three separate links and then fill out a survey. Heck, even the Skoll Global Threats Fund, which you would think would be, you know, designed to raise funds to combat global threats, has neither a donate button nor (so far as I can tell) a link to a donation page. These websites are *not* optimized for encouraging casual visitors to learn basic facts or make a donation.

A Landing Page for Casual Donors

That's fine with me; I imagine the leading x-risk websites are accomplishing other purposes that their owners feel are more important than catering to casual visitors -- but there ought to be at least one website that's meant for your buddy from high school who doesn't know or care about effective altruism, who expressed concern one night over a couple of beers that the world might be in some trouble, and who had a brief urge to do something about it. I want to help capture your buddy's urge to take action.

To that end, I've registered x-risk.com as a domain name, and I'm building a very simple website that will feature roughly 100 words of text about 10 of the most important existential risks, together with a photo or graphic that illustrates each risk, a "donate" button that takes you straight to a webpage that lets you donate to an organization working to prevent the risk, and a "learn more" button that takes you to a website with more detailed info on the risk. I will pay to host the website for one year, and if the website generates significant traffic, then I'll take up a collection to keep it going indefinitely.

Blurbs, Photos, and URLs

I would like your help generating content for the website -- if you are willing to write a 100-word blurb, if you own a useful photo (or can create one, or know of one in the public domain), or if you have the URL handy for a webpage that lets you donate money to mitigating or preventing a specific x-risk, please post it in the comments! I can, in theory, do all of that work myself, but I would prefer to make this more of a community project, and there is a significant risk that I will get bored and give up if I have to literally do it all myself.

Important: to avoid mind-killing debates, please do NOT contribute opinions about which risks are the most important unless you are ALSO contributing a blurb, photo, or URL in the same comment. Let's get the website built and launched first, and then we can always edit some of the pages later if there's a consensus in favor of including an additional x-risk. If you see someone sharing an opinion about the relative priority of risk and the opinion isn't right next to a useful resource, please vote that comment down until it disappears.

Thank you very much for your help! I hope to see you all in the future. :-)

 

Cooperating with agents with different ideas of fairness, while resisting exploitation

38 Eliezer_Yudkowsky 16 September 2013 08:27AM

There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.

(Old well-known ideas:)

Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.

Suppose we're going to play a PD iterated for four rounds.  We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.

If we mutually defect on every round, our net mutual payoff is (8, 8).  This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed.  If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse.  It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.

But (12, 12) isn't the only possible result on the Pareto boundary.  Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on every round, for a payoff of (9, 14) slanted their way.  If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).

I would consider it obvious that a rational agent should refuse this unfair bargain.  Otherwise agents with knowledge of your source code will offer you only this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.

(Newer ideas:)

Generalizing:  Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does better, is 'exploitable' relative to this fair bargain.  Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.

So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:

"Suppose we have a [magical] definition N of a fair outcome.  A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium."  (Note that this definition precludes giving in to a threat of blackmail.)

(Key possible-innovation:)

It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.

Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand.  They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).

Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.

But it does not create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'.  Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8).  And though you think the bargain is unfair, you are not creating incentives to exploit you.  By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12).  The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).

There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium.  So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more.  This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.

This translates into an informal principle of negotiations:  Be willing to accept unfair bargains, but only if (you make it clear) both sides are doing worse than what you consider to be a fair bargain.

I haven't seen this advocated before even as an informal principle of negotiations.  Is it in the literature anywhere?  Someone suggested Schelling might have said it, but didn't provide a chapter number.

ADDED:

Clarification 1:  Yes, utilities are invariant up to a positive affine transformation so there's no canonical way to split utilities evenly.  Hence the part about "Assume a magical solution N which gives us the fair division."  If we knew the exact properties of how to implement this magical solution, taking it at first for magical, that might give us some idea of what N should be, too.

Clarification 2:  The way this might work is that you pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome:  (100, 100), (98, 99), (96, 98).  Perhaps stop well short of Nash if the skew becomes too extreme.  Drop to Nash as the last resort.  The other agent does the same, starting with their own ideal of fairness on the Pareto boundary.  Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle.  Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness.  Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness.  This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it.  You should not be picking the descending sequence in an agent-dependent way that incentivizes, at cost to you, skewed claims about fairness.

Clarification 3:  You must take into account the other agent's costs and other opportunities when ensuring that the net outcome, in terms of final utilities, is worse for them than the reward offered for 'fair' cooperation.  Offering them the chance to buy half as many paperclips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same number of paperclips at a lower total price.

The Affect Heuristic

35 Eliezer_Yudkowsky 27 November 2007 07:58AM

The affect heuristic is when subjective impressions of goodness/badness act as a heuristic—a source of fast, perceptual judgments.  Pleasant and unpleasant feelings are central to human reasoning, and the affect heuristic comes with lovely biases—some of my favorites.

Let's start with one of the relatively less crazy biases.  You're about to move to a new city, and you have to ship an antique grandfather clock.  In the first case, the grandfather clock was a gift from your grandparents on your 5th birthday.  In the second case, the clock was a gift from a remote relative and you have no special feelings for it.  How much would you pay for an insurance policy that paid out $100 if the clock were lost in shipping?  According to Hsee and Kunreuther (2000), subjects stated willingness to pay more than twice as much in the first condition.  This may sound rational—why not pay more to protect the more valuable object?—until you realize that the insurance doesn't protect the clock, it just pays if the clock is lost, and pays exactly the same amount for either clock.  (And yes, it was stated that the insurance was with an outside company, so it gives no special motive to the movers.)

All right, but that doesn't sound too insane.  Maybe you could get away with claiming the subjects were insuring affective outcomes, not financial outcomes—purchase of consolation.

Then how about this?  Yamagishi (1997) showed that subjects judged a disease as more dangerous when it was described as killing 1,286 people out of every 10,000, versus a disease that was 24.14% likely to be fatal.  Apparently the mental image of a thousand dead bodies is much more alarming, compared to a single person who's more likely to survive than not.

But wait, it gets worse.

continue reading »

Human Evil and Muddled Thinking

40 Eliezer_Yudkowsky 13 September 2007 11:43PM

Followup toRationality and the English Language

George Orwell saw the descent of the civilized world into totalitarianism, the conversion or corruption of one country after another; the boot stamping on a human face, forever, and remember that it is forever.  You were born too late to remember a time when the rise of totalitarianism seemed unstoppable, when one country after another fell to secret police and the thunderous knock at midnight, while the professors of free universities hailed the Soviet Union's purges as progress.  It feels as alien to you as fiction; it is hard for you to take seriously.  Because, in your branch of time, the Berlin Wall fell.  And if Orwell's name is not carved into one of those stones, it should be.

Orwell saw the destiny of the human species, and he put forth a convulsive effort to wrench it off its path.  Orwell's weapon was clear writing.  Orwell knew that muddled language is muddled thinking; he knew that human evil and muddled thinking intertwine like conjugate strands of DNA:

In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness. Defenceless villages are bombarded from the air, the inhabitants driven out into the countryside, the cattle machine-gunned, the huts set on fire with incendiary bullets: this is called pacification...

continue reading »

Scope Insensitivity

45 Eliezer_Yudkowsky 14 May 2007 02:53AM

Once upon a time, three groups of subjects were asked how much they would pay to save 2000 / 20000 / 200000 migrating birds from drowning in uncovered oil ponds. The groups respectively answered $80, $78, and $88 [1]. This is scope insensitivity or scope neglect: the number of birds saved - the scope of the altruistic action - had little effect on willingness to pay.

Similar experiments showed that Toronto residents would pay little more to clean up all polluted lakes in Ontario than polluted lakes in a particular region of Ontario [2], or that residents of four western US states would pay only 28% more to protect all 57 wilderness areas in those states than to protect a single area [3].

continue reading »

Bayes' Theorem Illustrated (My Way)

126 komponisto 03 June 2010 04:40AM

(This post is elementary: it introduces a simple method of visualizing Bayesian calculations. In my defense, we've had other elementary posts before, and they've been found useful; plus, I'd really like this to be online somewhere, and it might as well be here.)

I'll admit, those Monty-Hall-type problems invariably trip me up. Or at least, they do if I'm not thinking very carefully -- doing quite a bit more work than other people seem to have to do.

What's more, people's explanations of how to get the right answer have almost never been satisfactory to me. If I concentrate hard enough, I can usually follow the reasoning, sort of; but I never quite "see it", and nor do I feel equipped to solve similar problems in the future: it's as if the solutions seem to work only in retrospect. 

Minds work differently, illusion of transparency, and all that.

Fortunately, I eventually managed to identify the source of the problem, and I came up a way of thinking about -- visualizing -- such problems that suits my own intuition. Maybe there are others out there like me; this post is for them.

continue reading »

The Robbers Cave Experiment

40 Eliezer_Yudkowsky 10 December 2007 06:18AM

Did you ever wonder, when you were a kid, whether your inane "summer camp" actually had some kind of elaborate hidden purpose—say, it was all a science experiment and the "camp counselors" were really researchers observing your behavior?

Me neither.

But we'd have been more paranoid if we'd read Intergroup Conflict and Cooperation:  The Robbers Cave Experiment by Sherif, Harvey, White, Hood, and Sherif (1954/1961).  In this study, the experimental subjects—excuse me, "campers"—were 22 boys between 5th and 6th grade, selected from 22 different schools in Oklahoma City, of stable middle-class Protestant families, doing well in school, median IQ 112.  They were as well-adjusted and as similar to each other as the researchers could manage. 

The experiment, conducted in the bewildered aftermath of World War II, was meant to investigate the causes—and possible remedies—of intergroup conflict.  How would they spark an intergroup conflict to investigate?  Well, the 22 boys were divided into two groups of 11 campers, and—

—and that turned out to be quite sufficient.

continue reading »