You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Human Minds are Fragile

22 diegocaleiro 11 February 2015 06:40PM

We are familiar with the thesis that Value is Fragile. This is why we are researching how to impart values to an AGI.

Embedded Minds are Fragile

Besides values, it may be worth remembering that human minds too are very fragile.

A little magnetic tampering with your amygdalas, and suddenly you are a wannabe serial killer. A small dose of LSD can get you to believe you can fly, or that the world will end in 4 hours. Remove part of your Ventromedial PreFrontal Cortex, and suddenly you are so utilitarian even Joshua Greene would call you a psycho.

It requires very little material change to substantially modify a human being's behavior. Same holds for other animals with embedded brains, crafted by evolution and made of squishy matter modulated by glands and molecular gates.

A Problem for Paul-Boxing and CEV?

One assumption underlying Paul-Boxing and CEV is that:

It is easier to specify and simulate a human-like mind then to impart values to an AGI by means of teaching it values directly via code or human language.

Usually we assume that because, as we know, value is fragile. But so are embedded minds. Very little tampering is required to profoundly transform people's moral intuitions. A large fraction of the inmate population in the US has frontal lobe or amygdala malfunctions.

Finding out the simplest description of a human brain that when simulated continues to act as that human brain would act in the real world may turn out to be as fragile, or even more fragile, than concept learning for AGI's.

CEV-tropes

8 snarles 22 September 2014 06:21PM

As seen in other threads, people disagree on whether CEV exists, and if it does, what it might turn out to be.

 

It would be nice to try to categorize common speculations about CEV.

1a. CEV doesn't exist, because human preferences are too divergent

1b. CEV doesn't even exist for a single human 

1c. CEV does exist, but it results in a return to the status quo

2a. CEV results in humans living in a physical (not virtual reality) utopia

2b. CEV results in humans returning to a more primitive society free of technology

2c. CEV results in humans living together in a simulation world, where most humans do not have god-like power

(the similarity between 2a, 2b, and 2c is that humans are still living in the same world, similar to traditional utopia scenarios)

3. CEV results in a wish for the annihilation of all life, or maybe the universe

4a. CEV results in all humans granted the right to be the god of their own private simulation universe (once we acquire the resources to do so)

4b. CEV can be implemented for "each salient group of living things in proportion to that group's moral weight"

5. CEV results in all humans agreeing to be wireheaded (trope)

6a. CEV results in all humans agreeing to merge into a single being and discarding many of the core features of humankind which have lost their purpose (trope)

6b. CEV results in humans agree to cease their own existence but also creating a superior life form--the outcome is similar to 6a, but the difference is that here, humans do not care about whether they are individually "merged"

7. CEV results in all/some humans willingly forgetting/erasing their history, or being indifferent to preserving history so that it is lost (compatible with all previous tropes)

Obviously there are too many possible ideas (or "tropes") to list, but perhaps we could get a sense of which ones are the most common in the LW community.  I leave it to someone else to create a poll supposing they feel they have a close to complete list, or create similar topics for AI risk, etc.

EDIT: Added more tropes, changed #2 since it was too broad: now #2 refers to CEV worlds where humans live in the "same world"

CEV: coherence versus extrapolation

14 Stuart_Armstrong 22 September 2014 11:24AM

It's just struck me that there might be a tension between the coherence (C) and the extrapolated (E) part of CEV. One reason that CEV might work is that the mindspace of humanity isn't that large - humans are pretty close to each other, in comparison to the space of possible minds. But this is far more true in every day decisions than in large scale ones.

Take a fundamentalist Christian, a total utilitarian, a strong Marxist, an extreme libertarian, and a couple more stereotypes that fit your fancy. What can their ideology tell us about their everyday activities? Well, very little. Those people could be rude, polite, arrogant, compassionate, etc... and their ideology is a very weak indication of that. Different ideologies and moral systems seem to mandate almost identical everyday and personal interactions (this is in itself very interesting, and causes me to see many systems of moralities as formal justifications of what people/society find "moral" anyway).

But now let's more to a more distant - "far" - level. How will these people vote in elections? Will they donate to charity, and if so, which ones? If they were given power (via wealth or position in some political or other organisation), how are they likely to use that power? Now their ideology is much more informative. Though it's not fully determinative, we would start to question the label if their actions at this level seemed out of synch. A Marxist that donated to a Conservative party, for instance, would give us pause, and we'd want to understand the apparent contradiction.

Let's move up yet another level. How would they design or change the universe if they had complete power? What is their ideal plan for the long term? At this level, we're entirely in far mode, and we would expect that their vastly divergent ideologies would be the most informative piece of information about their moral preferences. Details about their character and personalities, which loomed so large at the everyday level, will now be of far lesser relevance. This is because their large scale ideals are not tempered by reality and by human interactions, but exist in a pristine state in their minds, changing little if at all. And in almost every case, the world they imagine as their paradise will be literal hell for the others (and quite possibly for themselves).

To summarise: the human mindspace is much narrower in near mode than in far mode.

And what about CEV? Well, CEV is what we would be "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". The "were more the people we wished we were" is going to be dominated by the highly divergent far mode thinking. The "had grown up farther together" clause attempts to mesh these divergences, but that simply obscures the difficulty involved. The more we extrapolate, the harder coherence becomes.

It strikes me that there is a strong order-of-operations issue here. I'm not a fan of CEV, but it seems it would be much better to construct, first, the coherent volition of humanity, and only then to extrapolate it.

Prescriptive vs. descriptive and objective vs. subjective definitions

4 PhilGoetz 21 January 2014 11:21PM

Imagine you're writing a Field Guide to Boats, and you want to know what you should include in your field guide. Barges? Rafts? These things?

You want something like a dictionary definition of boat. A descriptive definition that includes anything people commonly think of as a boat; an objective definition, because you're only writing one book, not a separate version for each reader.

Now imagine you're stranded on an island, and you open a bottle, and a genie comes out and gives you one wish, and you say, "I wish for a boat!", and the genie says, "Well, what's a boat?" And you know, because you've read stories, that the genie will take your definition of "boat" and try to screw you over. You'd better not read out the dictionary definition, or the genie will give you a toy boat, or a boat with a hole in it, or a kayak too small for you to fit into. You need a prescriptive, subjective definition of a thing that will transport you over water.

continue reading »

Mahatma Armstrong: CEVed to death.

23 Stuart_Armstrong 06 June 2013 12:50PM

My main objection to Coherent Extrapolated Volition (CEV) is the "Extrapolated" part. I don't see any reason to trust the extrapolated volition of humanity - but this isn't just for self centred reasons. I don't see any reason to trust my own extrapolated volition. I think it's perfectly possible that my extrapolated volition would follow some scenario like this:

  1. It starts with me, Armstrong 1.  I want to be more altruistic at the next level, valuing other humans more.
  2. The altruistic Armstrong 2 wants to be even more altruistic. He makes himself into a perfectly altruistic utilitarian towards humans, and increases his altruism towards animals.
  3. Armstrong 3 wonders about the difference between animals and humans, and why he should value one of them more. He decided to increase his altruism equally towards all sentient creatures.
  4. Armstrong 4 is worried about the fact that sentience isn't clearly defined, and seems arbitrary anyway. He increase his altruism towards all living things.
  5. Armstrong 5's problem is that the barrier between living and non-living things isn't clear either (e.g. viruses). He decides that he should solve this by valuing all worthwhile things - is not art and beauty worth something as well?
  6. But what makes a thing worthwhile? Is there not art in everything, beauty in the eye of the right beholder? Armstrong 6 will make himself value everything.
  7. Armstrong 7 is in turmoil: so many animals prey upon other animals, or destroy valuable rocks! To avoid this, he decides the most moral thing he can do is to try and destroy all life, and then create a world of stasis for the objects that remain.

There are many other ways this could go, maybe ending up as a negative utilitarian or completely indifferent, but that's enough to give the flavour. You might trust the person you want to be, to do the right things. But you can't trust them to want to be the right person - especially several levels in (compare with the argument in this post, and my very old chaining god idea). I'm not claiming that such a value drift is inevitable, just that it's possible - and so I'd want my initial values to dominate when there is a large conflict.

Nor do I give Armstrong 7's values any credit for having originated from mine. Under torture, I'm pretty sure I could be made to accept any system of values whatsoever; there are other ways that would provably alter my values, so I don't see any reason to privilege Armstrong 7's values in this way.

"But," says the objecting strawman, "this is completely different! Armstrong 7's values are the ones that you would reach by following the path you would want to follow anyway! That's where you would get to, if you started out wanting to be more altruistic, had control over you own motivational structure, and grew and learnt and knew more!"

"Thanks for pointing that out," I respond, "now that I know where that ends up, I must make sure to change the path I would want to follow! I'm not sure whether I shouldn't be more altruistic, or avoid touching my motivational structure, or not want to grow or learn or know more. Those all sound pretty good, but if they end up at Armstrong 7, something's going to have to give."

CEV: a utilitarian critique

25 Pablo_Stafforini 26 January 2013 04:12PM

I'm posting this article on behalf of Brian Tomasik, who authored it but is at present too busy to respond to comments.

Update from Brian: "As of 2013-2014, I have become more sympathetic to at least the spirit of CEV specifically and to the project of compromise among differing value systems more generally. I continue to think that pure CEV is unlikely to be implemented, though democracy and intellectual discussion can help approximate it. I also continues to feel apprehensive about the conclusions that a CEV might reach, but the best should not be the enemy of the good, and cooperation is inherently about not getting everything you want in order to avoid getting nothing at all."


Introduction

I'm often asked questions like the following: If wild-animal suffering, lab universes, sentient simulations, etc. are so bad, why can't we assume that Coherent Extrapolated Volition (CEV) will figure that out and do the right thing for us?

 

Disclaimer

Most of my knowledge of CEV is based on Yudkowsky's 2004 paper, which he admits is obsolete. I have not yet read most of the more recent literature on the subject.

 

Reason 1: CEV will (almost certainly) never happen

CEV is like a dream for a certain type of moral philosopher: Finally, the most ideal solution for discovering what we really want upon reflection!

The fact is, the real world is not decided by moral philosophers. It's decided by power politics, economics, and Darwinian selection. Moral philosophers can certainly have an impact through these channels, but they're unlikely to convince the world to rally behind CEV. Can you imagine the US military -- during its AGI development process -- deciding to adopt CEV? No way. It would adopt something that ensures the continued military and political dominance of the US, driven by mainstream American values. Same goes for China or any other country. If AGI is developed by a corporation, the values will reflect those of the corporation or the small group of developers and supervisors who hold the most power over the project. Unless that group is extremely enlightened, CEV is not what we'll get.

Anyway, this is assuming that the developers of AGI can even keep it under control. Most likely AGI will turn into a paperclipper or else evolve into some other kind of Darwinian force over which we lose control.

Objection 1: "Okay. Future military or corporate developers of AGI probably won't do CEV. But why do you think they'd care about wild-animal suffering, etc. either?"

Well, they might not, but if we make the wild-animal movement successful, then in ~50-100 years when AGI does come along, the notion of not spreading wild-animal suffering might be sufficiently mainstream that even military or corporate executives would care about it, at least to some degree.

If post-humanity does achieve astronomical power, it will only be through AGI, so there's high value for influencing the future developers of an AGI. For this reason I believe we should focus our meme-spreading on those targets. However, this doesn't mean they should be our only focus, for two reasons: (1) Future AGI developers will themselves be influenced by their friends, popular media, contemporary philosophical and cultural norms, etc., so if we can change those things, we will diffusely impact future AGI developers too. (2) We need to build our movement, and the lowest-hanging fruit for new supporters are those most interested in the cause (e.g., antispeciesists, environmental-ethics students, transhumanists). We should reach out to them to expand our base of support before going after the big targets.

Objection 2: "Fine. But just as we can advance values like preventing the spread of wild-animal suffering, couldn't we also increase the likelihood of CEV by promoting that idea?"

Sure, we could. The problem is, CEV is not an optimal thing to promote, IMHO. It's sufficiently general that lots of people would want it, so for ourselves, the higher leverage comes from advancing our particular, more idiosyncratic values. Promoting CEV is kind of like promoting democracy or free speech: It's fine to do, but if you have a particular cause that you think is more important than other people realize, it's probably going to be better to promote that specific cause than to jump on the bandwagon and do the same thing everyone else is doing, since the bandwagon's cause may not be what you yourself prefer.

Indeed, for myself, it's possible CEV could be a net bad thing, if it would reduce the likelihood of paperclipping -- a future which might (or might not) contain far less suffering than a future directed by humanity's extrapolated values.

 

Reason 2: CEV would lead to values we don't like

Some believe that morality is absolute, in which case a CEV's job would be to uncover what that is. This view is mistaken, for the following reasons: (1) Existence of a separate realm of reality where ethical truths reside violates Occam's razor, and (2) even if they did exist, why would we care what they were?

Yudkowsky and the LessWrong community agree that ethics is not absolute, so they have different motivations behind CEV. As far as I can gather, the following are two of them:

Motivation 1: Some believe CEV is genuinely the right thing to do

As Eliezer said in his 2004 paper (p. 29), "Implementing CEV is just my attempt not to be a jerk." Some may believe that CEV is the ideal meta-ethical way to resolve ethical disputes.

I have to differ. First, the set of minds included in CEV is totally arbitrary, and hence, so will be the output. Why include only humans? Why not animals? Why not dead humans? Why not humans that weren't born but might have been? Why not paperclip maximizers? Baby eaters? Pebble sorters? Suffering maximizers? Wherever you draw the line, there you're already inserting your values into the process.

And then once you've picked the set of minds to extrapolate, you still have astronomically many ways to do the extrapolation, each of which could give wildly different outputs. Humans have a thousand random shards of intuition about values that resulted from all kinds of little, arbitrary perturbations during evolution and environmental exposure. If the CEV algorithm happens to make some more salient than others, this will potentially change the outcome, perhaps drastically (butterfly effects).

Now, I would be in favor of a reasonable extrapolation of my own values. But humanity's values are not my values. There are people who want to spread life throughout the universe regardless of suffering, people who want to preserve nature free from human interference, people who want to create lab universes because it would be cool, people who oppose utilitronium and support retaining suffering in the world, people who want to send members of other religions to eternal torture, people who believe sinful children should burn forever in red-hot ovens, and on and on. I do not want these values to be part of the mix.

Maybe (hopefully) some of these beliefs would go away once people learned more about what these wishes really implied, but some would not. Take abortion, for example: Some non-religious people genuinely oppose it, and not for trivial, misinformed reasons. They have thought long and hard about abortion and still find it to be wrong. Others have thought long and hard and still find it to be not wrong. At some point, we have to admit that human intuitions are genuinely in conflict in an irreconcilable way. Some human intuitions are irreconcilably opposed to mine, and I don't want them in the extrapolation process.

Motivation 2: Some argue that even if CEV isn't ideal, it's the best game-theoretic approach because it amounts to cooperating on the prisoner's dilemma

I think the idea is that if you try to promote your specific values above everyone else's, then you're timelessly causing this to be the decision of other groups of people who want to push for their values instead. But if you decided to cooperate with everyone, you would timelessly influence others to do the same.

This seems worth considering, but I'm doubtful that the argument is compelling enough to take too seriously. I can almost guarantee that if I decided to start cooperating by working toward CEV, everyone else working to shape values of the future wouldn't suddenly jump on board and do the same.

Objection 1: "Suppose CEV did happen. Then spreading concern for wild animals and the like might have little value, because the CEV process would realize that you had tried to rig the system ahead of time by making more people care about the cause, and it would attempt to neutralize your efforts."

Well, first of all, CEV is (almost certainly) never going to happen, so I'm not too worried. Second of all, it's not clear to me that such a scheme would actually be put in place. If you're trying to undo pre-CEV influences that led to the distribution of opinions to that point, you're going to have a heck of a lot of undoing to do. Are you going to undo the abundance of Catholics because their religion discouraged birth control and so led to large numbers of supporters? Are you going to undo the over-representation of healthy humans because natural selection unfairly removed all those sickly ones? Are you going to undo the under-representation of dinosaurs because an arbitrary asteroid killed them off before CEV came around?

The fact is that who has power at the time of AGI will probably matter a lot. If we can improve the values of those who will have power in the future, this will in expectation lead to better outcomes -- regardless of whether the CEV fairy tale comes true.

Inferring Values from Imperfect Optimizers

2 nigerweiss 29 December 2012 10:22PM

One approach to constructing a Friendly artificial intelligence is to create a piece of software that looks at large amounts of evidence about humans, and attempts to infer their values.  I've been doing some thinking about this problem, and I'm going to talk about some approaches and problems that have occurred to me.

 

In a naive approach, we might define the problem like this: take some unknown utility function, U, and plug it into a mathematically clean optimization process (like AIXI) O.  Then, look at your data set and take the information about the inputs and outputs of humans, and find the simplest U that best explains human behavior.

Unfortunately, this won't work.  The best possible match for U is one that models not just those elements of human utility we're interested in, but also all the details of our broken, contradictory optimization process.  The U we derive through this process will optimize for confirmation bias, scope insensitivity, hindsight bias, the halo effect, our own limited intelligence and inefficient use of evidence, and just about everything else that's wrong with us.  Not what we're looking for.

Okay, so let's try putting a bandaid on it - let's go back to our original problem setup.  However, we'll take our original O, and use all of the science on cognitive biases at our disposal to handicap it.  We'll limit its search space, saddle it with a laundry list of cognitive biases, cripple its ability to use evidence, and in general make it as human-like as we possibly can.  We could even give it akrasia by implementing hyperbolic discounting of reward.  Then we'll repeat the original process to produce U'.

If we plug U' into our AI, the result will be that it will optimize like a human who had suddenly been stripped of all the kinds of stupidity that we programmed into our modified O.  This is good!  Plugged into a solid CEV infrastructure, this might even be good enough to produce a future that's a nice place to live.  However, it's not quite ideal.  If we miss a cognitive bias, then it'll be incorporated into the learned utility functions, and we may never be rid of it.  What would be nice would be if we could get the AI to learn about cognitive biases, exhaustively, and update in the future if it ever discovered a new one.  

 

If we had enough time and money, we could do this the hard way: acquire a representative sample of the human population, and pay them to perform tasks with simple goals under tremendous surveillance, and have the AI derive the human optimization process from the actions taken towards a known goal.  However, if we assume that the human optimization process can be defined as a function over the state of the human brain, we should not trust the completeness of any such process learned from less data than the entropy of the human brain, which is on the order of tens of petabytes of extremely high quality evidence.  If we want to be confident in the completeness of our model, we may need more experimental evidence than it is really practical to accumulate.  Which isn't to say that this approach is useless - if we can hit close enough to the mark, then the AI may be able to run more exhaustive experimentation later and refine its own understanding of human brains to be closer to the ideal.

But it'd really be nice if our AI could do unsupervised learning to figure out the details of human optimization.  Then we could simply dump the internet into it, and let it grind away at the data and spit out a detailed, complete model of human decision-making, from which our utility function could be derived.  Unfortunately, this does not seem to be a tractable problem.  It's possible that some insight could be gleaned by examining outliers with normal intelligence, but deviant utility functions (I am thinking specifically of sociopaths), but it's unclear how much insight can be produced by these methods.  If anyone has suggestions for a more efficient way of going about it, I'd love to hear it.  As it stands, it might be possible to get enough information from this to supplement a supervised learning approach - the closer we get to a perfectly accurate model, the higher the probability of Things Going Well.                  

Anyways, that's where I am right now.  I just thought I'd put up my thoughts and see if some fresh eyes see anything I've been missing.  

 

Cheers,

Niger 

Ideal Advisor Theories and Personal CEV

24 lukeprog 25 December 2012 01:04PM

Update 5-24-2013: A cleaned-up, citable version of this article is now available on MIRI's website.

Co-authored with crazy88

Summary: Yudkowsky's "coherent extrapolated volition" (CEV) concept shares much in common Ideal Advisor theories in moral philosophy. Does CEV fall prey to the same objections which are raised against Ideal Advisor theories? Because CEV is an epistemic rather than a metaphysical proposal, it seems that at least one family of CEV approaches (inspired by Bostrom's parliamentary model) may escape the objections raised against Ideal Advisor theories. This is not a particularly ambitious post; it mostly aims to place CEV in the context of mainstream moral philosophy.

What is of value to an agent? Maybe it's just whatever they desire. Unfortunately, our desires are often the product of ignorance or confusion. I may desire to drink from the glass on the table because I think it is water when really it is bleach. So perhaps something is of value to an agent if they would desire that thing if fully informed. But here we crash into a different problem. It might be of value for an agent who wants to go to a movie to look up the session times, but the fully informed version of the agent will not desire to do so — they are fully-informed and hence already know all the session times. The agent and its fully-informed counterparts have different needs. Thus, several philosophers have suggested that something is of value to an agent if an ideal version of that agent (fully informed, perfectly rational, etc.) would advise the non-ideal version of the agent to pursue that thing.

This idea of idealizing or extrapolating an agent's preferences1 goes back at least as far as Sidgwick (1874), who considered the idea that "a man's future good" consists in "what he would now desire... if all the consequences of all the different [actions] open to him were accurately forseen..." Similarly, Rawls (1971) suggested that a person's good is the plan "that would be decided upon as the outcome of careful reflection in which the agent reviewed, in the light of all the relevant facts, what it would be like to carry out these plans..." More recently, in an article about rational agents and moral theory, Harsanyi (1982) defined what an agent's rational wants as “the preferences he would have if he had all the relevant factual information, always reasoned with the greatest possible care, and were in a state of mind most conducive to rational choice.” Then, a few years later, Railton (1986) identified a person's good with "what he would want himself to want... were he to contemplate his present situation from a standpoint fully and vividly informed about himself and his circumstances, and entirely free of cognitive error or lapses of instrumental rationality."

Rosati (1995) calls these theories Ideal Advisor theories of value because they identify one's personal value with what an ideal version of oneself would advise the non-ideal self to value.

Looking not for a metaphysical account of value but for a practical solution to machine ethics (Wallach & Allen 2009; Muehlhauser & Helm 2012), Yudkowsky (2004) described a similar concept which he calls "coherent extrapolated volition" (CEV):

In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

In other words, the CEV of humankind is about the preferences that we would have as a species if our preferences were extrapolated in certain ways. Armed with this concept, Yudkowsky then suggests that we implement CEV as an "initial dynamic" for "Friendly AI." Tarleton (2010) explains that the intent of CEV is that "our volition be extrapolated once and acted on. In particular, the initial extrapolation could generate an object-level goal system we would be willing to endow a superintelligent [machine] with."

CEV theoretically avoids many problems with other approaches to machine ethics (Yudkowsky 2004; Tarleton 2010; Muehlhauser & Helm 2012). However, there are reasons it may not succeed. In this post, we examine one such reason: Resolving CEV at the level of humanity (Global CEV) might require at least partially resolving CEV at the level of individuals (Personal CEV)2, but Personal CEV is similar to ideal advisor theories of value,3 and such theories face well-explored difficulties. As such, these difficulties may undermine the possibility of determining the Global CEV of humanity.

Before doing so, however, it's worth noting one key difference between Ideal Advisor theories of value and Personal CEV. Ideal Advisor theories typically are linguistic or metaphysical theories, while the role of Personal CEV is epistemic. Ideal Advisor theorists attempts to define what it is for something to be of value for an agent. Because of this, their accounts needs to give an unambiguous and plausible answer in all cases. On the other hand, Personal CEV's role is an epistemic one: it isn't intended to define what is of value for an agent. Rather, Personal CEV is offered as a technique that can help an AI to come to know, to some reasonable but not necessarily perfect level of accuracy, what is of value for the agent. To put it more precisely, Personal CEV is intended to allow an initial AI to determine what sort of superintelligence to create such that we end up with what Yudkowsky calls a "Nice Place to Live." Given this, certain arguments are likely to threaten Ideal Advisor theories and not to Personal CEV, and vice versa.

With this point in mind, we now consider some objections to ideal advisor theories of value, and examine whether they threaten Personal CEV.

continue reading »

Brief Question about FAI approaches

3 Dolores1984 19 September 2012 06:05AM

I've been reading through this to get a sense of the state of the art at the moment:

http://lukeprog.com/SaveTheWorld.html

Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans.  This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.

Why shouldn't we just try to design an average bounded utility maximizer?  You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time.  To measure the utility (or at least approximate it), you could just ask the models.  The number this spits out is the output of your utility function.  It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and  I acknowledge that that last one is going to be a bitch to formalize.  When you've got this utility function, you just... maximize it.  

Something like this approach seems much more robust.  Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded.  Because it's bounded, you're even protected against utility monsters.  Has something like this been considered?  Is there an obvious reason it won't work, or would produce undesirable results?

Thanks,

Dolores        

Friendly AI and the limits of computational epistemology

18 Mitchell_Porter 08 August 2012 01:16PM

Very soon, Eliezer is supposed to start posting a new sequence, on "Open Problems in Friendly AI". After several years in which its activities were dominated by the topic of human rationality, this ought to mark the beginning of a new phase for the Singularity Institute, one in which it is visibly working on artificial intelligence once again. If everything comes together, then it will now be a straight line from here to the end.

I foresee that, once the new sequence gets going, it won't be that easy to question the framework in terms of which the problems are posed. So I consider this my last opportunity for some time, to set out an alternative big picture. It's a framework in which all those rigorous mathematical and computational issues still need to be investigated, so a lot of "orthodox" ideas about Friendly AI should carry across. But the context is different, and it makes a difference.

Begin with the really big picture. What would it take to produce a friendly singularity? You need to find the true ontology, find the true morality, and win the intelligence race. For example, if your Friendly AI was to be an expected utility maximizer, it would need to model the world correctly ("true ontology"), value the world correctly ("true morality"), and it would need to outsmart its opponents ("win the intelligence race").

Now let's consider how SI will approach these goals.

The evidence says that the working ontological hypothesis of SI-associated researchers will be timeless many-worlds quantum mechanics, possibly embedded in a "Tegmark Level IV multiverse", with the auxiliary hypothesis that algorithms can "feel like something from inside" and that this is what conscious experience is.

The true morality is to be found by understanding the true decision procedure employed by human beings, and idealizing it according to criteria implicit in that procedure. That is, one would seek to understand conceptually the physical and cognitive causation at work in concrete human choices, both conscious and unconscious, with the expectation that there will be a crisp, complex, and specific answer to the question "why and how do humans make the choices that they do?" Undoubtedly there would be some biological variation, and there would also be significant elements of the "human decision procedure",  as instantiated in any specific individual, which are set by experience and by culture, rather than by genetics. Nonetheless one expects that there is something like a specific algorithm or algorithm-template here, which is part of the standard Homo sapiens cognitive package and biological design; just another anatomical feature, particular to our species.

Having reconstructed this algorithm via scientific analysis of human genome, brain, and behavior, one would then idealize it using its own criteria. This algorithm defines the de-facto value system that human beings employ, but that is not necessarily the value system they would wish to employ; nonetheless, human self-dissatisfaction also arises from the use of this algorithm to judge ourselves. So it contains the seeds of its own improvement. The value system of a Friendly AI is to be obtained from the recursive self-improvement of the natural human decision procedure.

Finally, this is all for naught if seriously unfriendly AI appears first. It isn't good enough just to have the right goals, you must be able to carry them out. In the global race towards artificial general intelligence, SI might hope to "win" either by being the first to achieve AGI, or by having its prescriptions adopted by those who do first achieve AGI. They have some in-house competence regarding models of universal AI like AIXI, and they have many contacts in the world of AGI research, so they're at least engaged with this aspect of the problem.

Upon examining this tentative reconstruction of SI's game-plan, I find I have two major reservations. The big one, and the one most difficult to convey, concerns the ontological assumptions. In second place is what I see as an undue emphasis on the idea of outsourcing the methodological and design problems of FAI research to uploaded researchers and/or a proto-FAI which is simulating or modeling human researchers. This is supposed to be a way to finesse philosophical difficulties like "what is consciousness anyway"; you just simulate some humans until they agree that they have solved the problem. The reasoning goes that if the simulation is good enough, it will be just as good as if ordinary non-simulated humans solved it.

I also used to have a third major criticism, that the big SI focus on rationality outreach was a mistake; but it brought in a lot of new people, and in any case that phase is ending, with the creation of CFAR, a separate organization. So we are down to two basic criticisms.

First, "ontology". I do not think that SI intends to just program its AI with an apriori belief in the Everett multiverse, for two reasons. First, like anyone else, their ventures into AI will surely begin with programs that work within very limited and more down-to-earth ontological domains. Second, at least some of the AI's world-model ought to be obtained rationally. Scientific theories are supposed to be rationally justified, e.g. by their capacity to make successful predictions, and one would prefer that the AI's ontology results from the employment of its epistemology, rather than just being an axiom; not least because we want it to be able to question that ontology, should the evidence begin to count against it.

For this reason, although I have campaigned against many-worlds dogmatism on this site for several years, I'm not especially concerned about the possibility of SI producing an AI that is "dogmatic" in this way. For an AI to independently assess the merits of rival physical theories, the theories would need to be expressed with much more precision than they have been in LW's debates, and the disagreements about which theory is rationally favored would be replaced with objectively resolvable choices among exactly specified models.

The real problem, which is not just SI's problem, but a chronic and worsening problem of intellectual culture in the era of mathematically formalized science, is a dwindling of the ontological options to materialism, platonism, or an unstable combination of the two, and a similar restriction of epistemology to computation.

Any assertion that we need an ontology beyond materialism (or physicalism or naturalism) is liable to be immediately rejected by this audience, so I shall immediately explain what I mean. It's just the usual problem of "qualia". There are qualities which are part of reality - we know this because they are part of experience, and experience is part of reality - but which are not part of our physical description of reality. The problematic "belief in materialism" is actually the belief in the completeness of current materialist ontology, a belief which prevents people from seeing any need to consider radical or exotic solutions to the qualia problem. There is every reason to think that the world-picture arising from a correct solution to that problem will still be one in which you have "things with states" causally interacting with other "things with states", and a sensible materialist shouldn't find that objectionable.

What I mean by platonism, is an ontology which reifies mathematical or computational abstractions, and says that they are the stuff of reality. Thus assertions that reality is a computer program, or a Hilbert space. Once again, the qualia are absent; but in this case, instead of the deficient ontology being based on supposing that there is nothing but particles, it's based on supposing that there is nothing but the intellectual constructs used to model the world.

Although the abstract concept of a computer program (the abstractly conceived state machine which it instantiates) does not contain qualia, people often treat programs as having mind-like qualities, especially by imbuing them with semantics - the states of the program are conceived to be "about" something, just like thoughts are. And thus computation has been the way in which materialism has tried to restore the mind to a place in its ontology. This is the unstable combination of materialism and platonism to which I referred. It's unstable because it's not a real solution, though it can live unexamined for a long time in a person's belief system.

An ontology which genuinely contains qualia will nonetheless still contain "things with states" undergoing state transitions, so there will be state machines, and consequently, computational concepts will still be valid, they will still have a place in the description of reality. But the computational description is an abstraction; the ontological essence of the state plays no part in this description; only its causal role in the network of possible states matters for computation. The attempt to make computation the foundation of an ontology of mind is therefore proceeding in the wrong direction.

But here we run up against the hazards of computational epistemology, which is playing such a central role in artificial intelligence. Computational epistemology is good at identifying the minimal state machine which could have produced the data. But it cannot by itself tell you what those states are "like". It can only say that X was probably caused by a Y that was itself caused by Z.

Among the properties of human consciousness are knowledge that something exists, knowledge that consciousness exists, and a long string of other facts about the nature of what we experience. Even if an AI scientist employing a computational epistemology managed to produce a model of the world which correctly identified the causal relations between consciousness, its knowledge, and the objects of its knowledge, the AI scientist would not know that its X, Y, and Z refer to, say, "knowledge of existence", "experience of existence", and "existence". The same might be said of any successful analysis of qualia, knowledge of qualia, and how they fit into neurophysical causality.

It would be up to human beings - for example, the AI's programmers and handlers - to ensure that entities in the AI's causal model were given appropriate significance. And here we approach the second big problem, the enthusiasm for outsourcing the solution of hard problems of FAI design to the AI and/or to simulated human beings. The latter is a somewhat impractical idea anyway, but here I want to highlight the risk that the AI's designers will have false ontological beliefs about the nature of mind, which are then implemented apriori in the AI. That strikes me as far more likely than implanting a wrong apriori about physics; computational epistemology can discriminate usefully between different mathematical models of physics, because it can judge one state machine model as better than another, and current physical ontology is essentially one of interacting state machines. But as I have argued, not only must the true ontology be deeper than state-machine materialism, there is no way for an AI employing computational epistemology to bootstrap to a deeper ontology.

In a phrase: to use computational epistemology is to commit to state-machine materialism as your apriori ontology. And the problem with state-machine materialism is not that it models the world in terms of causal interactions between things-with-states; the problem is that it can't go any deeper than that, yet apparently we can. Something about the ontological constitution of consciousness makes it possible for us to experience existence, to have the concept of existence, to know that we are experiencing existence, and similarly for the experience of color, time, and all those other aspects of being that fit so uncomfortably into our scientific ontology.

It must be that the true epistemology, for a conscious being, is something more than computational epistemology. And maybe an AI can't bootstrap its way to knowing this expanded epistemology - because an AI doesn't really know or experience anything, only a consciousness, whether natural or artificial, does those things - but maybe a human being can. My own investigations suggest that the tradition of thought which made the most progress in this direction was the philosophical school known as transcendental phenomenology. But transcendental phenomenology is very unfashionable now, precisely because of apriori materialism. People don't see what "categorial intuition" or "adumbrations of givenness" or any of the other weird phenomenological concepts could possibly mean for an evolved Bayesian neural network; and they're right, there is no connection. But the idea that a human being is a state machine running on a distributed neural computation is just a hypothesis, and I would argue that it is a hypothesis in contradiction with so much of the phenomenological data, that we really ought to look for a more sophisticated refinement of the idea. Fortunately, 21st-century physics, if not yet neurobiology, can provide alternative hypotheses in which complexity of state originates from something other than concatenation of parts - for example, entanglement, or from topological structures in a field. In such ideas I believe we see a glimpse of the true ontology of mind, one which from the inside resembles the ontology of transcendental phenomenology; which in its mathematical, formal representation may involve structures like iterated Clifford algebras; and which in its biophysical context would appear to be describing a mass of entangled electrons in that hypothetical sweet spot, somewhere in the brain, where there's a mechanism to protect against decoherence.

Of course this is why I've talked about "monads" in the past, but my objective here is not to promote neo-monadology, that's something I need to take up with neuroscientists and biophysicists and quantum foundations people. What I wish to do here is to argue against the completeness of computational epistemology, and to caution against the rejection of phenomenological data just because it conflicts with state-machine materialism or computational epistemology. This is an argument and a warning that should be meaningful for anyone trying to make sense of their existence in the scientific cosmos, but it has a special significance for this arcane and idealistic enterprise called "friendly AI". My message for friendly AI researchers is not that computational epistemology is invalid, or that it's wrong to think about the mind as a state machine, just that all that isn't the full story. A monadic mind would be a state machine, but ontologically it would be different from the same state machine running on a network of a billion monads. You need to do the impossible one more time, and make your plans bearing in mind that the true ontology is something more than your current intellectual tools allow you to represent.

Stanovich on CEV

13 lukeprog 29 April 2012 09:37AM

Keith Stanovich is a leading expert on the cogsci of rationality, but he also also written on a problem related to CEV, that of the "rational integration" of our preferences. Here he is on pages 81-86 of Rationality and the Reflective Mind (currently my single favorite book on rationality, out of the dozens I've read):

All multiple-process models of mind capture a phenomenal aspect of human decision making that is of profound importance — that humans often feel alienated from their choices. We display what folk psychology and philosophers term weakness of will. For example, we continue to smoke when we know that it is a harmful habit. We order a sweet after a large meal, merely an hour after pledging to ourselves that we would not. In fact, we display alienation from our responses even in situations that do not involve weakness of will — we find ourselves recoiling from the sight of a disfigured person even after a lifetime of dedication to diversity and inclusion.

This feeling of alienation — although emotionally discomfiting when it occurs — is actually a reflection of a unique aspect of human cognition: the use of Type 2 metarepresentational abilities to enable a cognitive critique of our beliefs and our desires. Beliefs about how well we are forming beliefs become possible because of such metarepresentation, as does the ability to evaluate one's desires — to desire to desire differently...

...There is a philosophical literature on the notion of higher-order evaluation of desires... For example, in a classic paper on second-order desires, Frankfurt (1971) speculated that only humans have such metarepresentational states. He evocatively termed creatures without second-order desires (other animals, human babies) wantons... A wanton simply does not reflect on his/her goals. Wantons want — but they do not care what they want.

Nonwantons, however, can represent a model of an idealized preference structure — perhaps, for example, a model based on a superordinate judgment of long-term lifespan considerations... So a human can say: I would prefer to prefer not to smoke. This second-order preference can then become a motivational competitor to the first-order preference. At the level of second-order preferences, I prefer to prefer to not smoke; nevertheless, as a first-order preference, I prefer to smoke. The resulting conflict signals that I lack what Nozick (1993) terms rational integration in my preference structures. Such a mismatched first-/second-order preference structure is one reason why humans are often less rational than bees in an axiomatic sense (see Stanovich 2004, pp. 243-247). This is because the struggle to achieve rational integration can destabilize first-order preferences in ways that make them more prone to the context effects that lead to the violation of the basic axioms of utility theory (see Lee, Amir, & Ariely 2009).

The struggle for rational integration is also what contributes to the feeling of alienation that people in the modern world often feel when contemplating the choices that they have made. People easily detect when their high-order preferences conflict with the choices actually made.

Of course, there is no limit to the hierarchy of higher-order desires that might be constructed. But the representational abilities of humans may set some limits — certainly three levels above seems a realistic limit for most people in the nonsocial domain (Dworking 1988). However, third-order judgments can be called upon to to help achieve rational integration at lower levels. So, for example, imagine that John is a smoker. He might realize the following when he probes his feelings: He prefers his preference to prefer not to smoke over his preference for smoking.

We might in this case say that John's third-order judgment has ratified his second-order evaluation. Presumably this ratification of his second-order judgment adds to the cognitive pressure to change the first-order preference by taking behavioral measures that will make change more likely (entering a smoking secession program, consulting his physician, staying out of smoky bars, etc.).

On the other hand, a third-order judgment might undermine the second-order preference by failing to ratify it: John might prefer to smoke more than he prefers his preference to prefer not to smoke.

In this case, although John wishes he did not want to smoke, the preference for this preference is not as strong as his preference for smoking itself. We might suspect that this third-order judgment might not only prevent John from taking strong behavioral steps to rid himself of his addiction, but that over time it might erode his conviction in his second-order preference itself, thus bringing rational integration to all three levels.

Typically, philosophers have tended to bias their analyses toward the highest level desire that is constructed — privileging the highest point in the regress of higher-order evaluations, using that as the foundation, and defining it as the true self. Modern cognitive science would suggest instead a Neurathian project in which no level of analysis is uniquely privileged. Philosopher Otto Neurath... employed the metaphor of a boat having some rotten planks. The best way to repair the planks would be to bring the boat ashore, stand on firm ground, and replace the planks. But what if the boat could not be brought ashore? Actually, the boat could still be repaired but at some risk. We could repair the planks at sea by standing on some of the planks while repairing others. The project could work — we could repair the boat without being on the firm foundation of ground. The Neurathian project is not guaranteed, however, because we might choose to stand on a rotten plank. For example, nothing in Frankfurt's (1971) notion of higher-order desires guarantees against higher-order judgments being infected by memes... that are personally damaging.

Also see: The Robot's Rebellion, Higher order preferences the master rationality motive, Wanting to WantThe Human's Hidden Utility Function (Maybe), Indirect Normativity

Extrapolating values without outsourcing

7 Mitchell_Porter 27 April 2012 06:39AM

I first took note of "Coherent Extrapolated Volition" in 2006. I thought it was a brilliant idea, an exact specification of how to arrive at a better future: Figure out exactly how it is that humans make their existing choices, idealize that human decision procedure according to its own criteria, and then use the resulting "renormalized human utility function" as the value system of an AI. The first step is a problem in cognitive neuroscience, the second step is a conceptual problem in reflective decision theory, and the third step is where you make the Friendly AI.

For some reason, rather than pursuing this research program directly, people interested in CEV talk about using simulated human beings ("uploads", "ems", "whole-brain emulations") to do all the hard work. Paul Christiano just made a post called "Formalizing Value Extrapolation"; but it's really about formalizing the safe outsourcing of value extrapolation to a group of human uploads. All the details of how value extrapolation is actually performed (e.g. the three steps listed above) are left completely unspecified. Another recent article proposed that making an AI with a submodule based on models of its makers' opinions is the fast way to Friendly AI. It's also been suggested to me that simulating human thinkers and running them for centuries of subjective time until they reach agreement on the nature of consciousness is a way to tackle that problem; and clearly the same "solution" could be applied to any other aspect of FAI design, strategy, and tactics.

Whatever its value as a thought experiment, in my opinion this idea of outsourcing the hard work to simulated humans has zero practical value, and we would be much better off if the minuscule sub-sub-culture of people interested in creating Friendly AI didn't think in this way. Daydreaming about how they'd solve the problem of FAI in Permutation City is a recipe for irrelevance.

Suppose we were trying to make a "C.elegans-friendly AI". The first thing we would do is take the first step mentioned above - we would try to figure out the C.elegans utility function or decision procedure. Then we would have to decide how to aggregate utility across multiple individuals. Then we would make the AI. Performing this task for H.sapiens is a lot more difficult, and qualitatively new factors enter at the first and second steps, but I don't see why it is fundamentally different, different enough that we need to engage in the rigmarole of delegating the task to uploaded human beings. It shouldn't be necessary, and we probably won't even get the chance to do so; by the time you have hardware and neuro-expertise sufficient to emulate a whole human brain, you will most likely have nonhuman AI anyway.

A year ago, I wrote: "My expectation is that the presently small fields of machine ethics and neuroscience of morality will grow rapidly and will come into contact, and there will be a distributed research subculture which is consciously focused on determining the optimal AI value system in the light of biological human nature. In other words, there will be human minds trying to answer this question long before anyone has the capacity to direct an AI to solve it. We should expect that before we reach the point of a Singularity, there will be a body of educated public opinion regarding what the ultimate utility function or decision method (for a transhuman AI) should be, deriving from work in those fields which ought to be FAI-relevant but which have yet to engage with the problem. In other words, they will be collectively engaging with the problem before anyone gets to outsource the necessary research to AIs."

I'll also link to my previous post about "practical Friendly AI". What I'm doing here is going into a fraction more detail about how you arrive at the Friendly value system. There, I basically said that you just get a committee together and figure it out, clearly an inadequate recipe, but in that article I was focused more on sketching the nature of an organization and a plan which would have some chance of genuinely creating FAI in the real world. Here, I'll say that working out the Friendly value system consists of: making a naturalistic explanation of how human decision-making occurs; determining the core essentials of that process, and applying its own metamoral criteria to arrive at a "renormalized" decision procedure that has been idealized according to human cognition's own preferences ("our wish if we knew more, thought faster, were more the people we wished we were"); and then implementing that decision procedure within an AI - this is where all the value-neutral parts of AI research come into play, such as AGI theory, the theory of value stability under self-modification, and so on. That is the sort of "value extrapolation" that we should be "formalizing" - and preparing to carry out in real life. 

Bootstrapping to Friendliness

2 [deleted] 26 April 2012 09:54PM

"All that is necessary for evil to triumph is that good men do nothing."



 

155,000 people are dying, on average, every day.  For those of us who are preference utilitarians, and also believe that a Friendly singularity is possible, and capable of ending this state of affairs, it also puts a great deal of pressure on us.  It doesn't give us leave to be sloppy (because human extinction, even multiplied by a low probability, is a massive negative utility).  But, if we see a way to achieve similar results in a shorter time frame, the cost to human life of not taking it is simply unacceptable.

I have some concerns about CEV on a conceptual level, but I'm leaving those aside for the time being.  My concern is that most of the organizations concerned with a first-mover X-risk are not in a position to be that first mover -- and, furthermore, they're not moving in that direction.  That includes the Singularity Institute.  Trying to operationalize CEV seems like a good way to get an awful lot of smart people bashing their heads against a wall while clever idiots trundle ahead with their own experiments.  I'm not saying that we should be hasty, but I am suggesting that we need to be careful of getting stuck in dark intellectual forests with lots of things that are fun to talk about until an idiot with the tinderbox burns it down.

My point, in short, is that we need to be looking for better ways to do things, and to do them extremely quickly.  We are working on a very, very, existentially tight schedule.    

 

So, if we're looking for quicker paths to a Friendly, first-mover singularity, I'd like to talk about one that seems attractive to me.  Maybe it's a useful idea.  If not, then at least I won't waste any more time thinking about it.  Either way, I'm going to lay it out and you guys can see what you think.  

 

So, Friendliness is a hard problem.  Exactly how hard, we don't know, but a lot of smart people have radically different ideas of how to attack it, and they've all put a lot of thought into it, and that's not a good sign.  However, designing a strongly superhuman AI is also a hard problem.  Probably much harder than a human can solve.  The good news is, we don't expect that we'll have to.  If we can build something just a little bit smarter than we are, we expect that bootstrapping process to take off without obvious limit.

So let's apply the same methodology to Friendliness.  General goal optimizers are tools, after all.  Probably the most powerful tools that have ever existed, for that matter.  Let's say we build something that's not Friendly.  Not something we want running the universe -- but, Friendly enough.  Friendly enough that it's not going to kill us all.  Friendly enough not to succumb to the pedantic genie problem.  Friendly enough we can use it to build what we really want, be it CEV or something else.  

I'm going to sketch out an architecture of what such a system might look like.  Do bear in mind this is just a sketch, and in no way a formal, safe, foolproof design spec.  

So, let's say we have an agent with the ability to convert unstructured data into symbolic relationships that represent the world, with explicitly demarcated levels of abstraction.  Let's say the system has the ability to build Bayesian causal relationships out of its data points over time, and construct efficient, predictive models of the behavior of the concepts in the world.  Let's also say that the system has the ability to take a symbolic representation of a desired future distribution of universes, a symbolic representation of the current universe, and map between them, finding valid chains of causality leading from now to then, probably using a solid decision theory background.  These are all hard problems to solve, but they're the same problems everyone else is solving too.  

This system, if you just specify parameters about the future and turn it loose, is not even a little bit Friendly.  But let's say you do this: first, provide it with a tremendous amount of data, up to and including the entire available internet, if necessary.  Everything it needs to build extremely effective models of human beings, with strongly generalized predictive power.  Then you incorporate one or more of those models (say, a group of trusted people) as a functional components: the system uses them to generalize natural language instructions first into a symbolic graph, and then into something actionable, working out the details of what it meant, rather than what is said.  Then, when the system is finding valid paths of causality, it takes its model of the state of the universe at the end of each course of action, feeds them into its human-models, and gives them a veto vote.  Think of it as the emergency regret button, iterated computationally for each possibility considered by the genie.  Any of them that any of the person-models find unacceptable are disregarded.

(small side note: as described here, the models would probably eventually be indistinguishable from uploaded minds, and would be created, simulated for a short time, and destroyed uncountable trillions of times -- you'd either need to drastically limit the simulation depth of a models, or ensure that everyone who you signed up to be one of the models knew the sacrifice they were making)

So, what you've got, plus or minus some spit and polish, is a very powerful optimization engine that understands what you mean, and disregards obviously unacceptable possibilities.  If you ask it for a truly Friendly AI, it will help you first figure out what you mean by that, then help you build it, then help you formally prove that it's safe.  It would turn itself off if you asked it too, and meant it.  It would also exterminate the human species if you asked it to and meant it.  Not Friendly, but Friendly enough to build something better.  

With this approach, the position of the Friendly AI researcher changes.  Instead of being in an arms race with the rest of the AI field with a massive handicap (having to solve two incredibly hard problems against opponents who only have to solve one), we only have to solve a relatively simpler problem (building a Friendly-enough AI), which we can then instruct to sabotage unFriendly AI projects and buy some time to develop the real deal.  It turns it into a fair fight, one that we might actually win.  




Anyone have any thoughts on this idea?        


Troubles With CEV Part2 - CEV Sequence

8 diegocaleiro 28 February 2012 04:19AM

The CEV Sequence Summary: The CEV sequence consists of three posts tackling important aspects of CEV. It covers conceptual, practical and computational problems of CEV's current form. On What Selves Are draws on analytic philosophy methods in order to clarify the concept of Self, which is necessary in order to understand whose volition is going to be extrapolated by a machine that implements the CEV procedure. Troubles with CEV part1 and Troubles with CEV part2 on the other hand describe several issues that will be faced by the CEV project if it is actually going to be implemented. Those issues are not of conceptual nature. Many of the objections shown come from scattered discussions found on the web. Finally, six alternatives to CEV are considered.

 

Troubles with CEV Summary: Starting with a summary of CEV, we proceed to show several objections to CEV. First, specific objections to the use of Coherence, Extrapolation, and Volition. Here Part1 ends. Then, in Part2, we continue with objections related to the end product of performing a CEV, and finally, problems relating to the implementation of CEV. We then go on with a praise of CEV, pointing out particular strengths of the idea. We end by showing six alternatives to CEV that have been proposed, and considering their vices and virtues.

Meta: I think Troubles With CEV Part1 and Part2 should be posted to Main. So on the comment section of Part2, I put a place to vote for or against this upgrade.

 


Troubles with CEV Part2

 

5) Problems with the end product

5a) Singleton Objection. Even if all goes well and a machine executes the coherent extrapolated volition of humanity, the self modifying code it is running is likely to become the most powerful agent on earth (including individuals, governments, industries and other machines) If such a superintelligence unfolds, whichever goals it has (our CE volitions) it will be very capable of implementing. This is a singleton scenario. A singleton is “[T]he term refers to a world order in which there is a single decision-making agency at the highest level. Among its powers would be (1) the ability to prevent any threats (internal or external) to its own existence and supremacy, and (2) the ability to exert effective control over major features of its domain (including taxation and territorial allocation).”. Even though at first sight the emergence of a singleton looks totalitarian, there is good reason to establish a singleton as opposed to several competing superintelligences. If a singleton is obtained, the selective process of genetic and cultural evolution meets with a force that can counter its own powers. Something other than selection of the fittest takes place as the main developer of the course of history. This is desirable for several reasons. Evolution favors flamboyant displays, malthusian growth and in general a progressively lower income, with our era being an exception in its relative abundance of resources. Evolution operates on many levels (genes, memes, individuals, institutions, groups) and there is conflict and survival of the fittest in all of them. If evolution were to continue being the main driving force of our society there is great likelihood that several of the things we find valuable would be lost. Much of what we value has evolved as signaling (dancing, singing, getting jokes) and it is likely that some of that costly signaling would be lost without a controlling force such as a singleton. For this reason, having a singleton can be considered a good result in the grand scheme of things, and should not constitute worry to the CEV project, despite initial impressions otherwise. In fact if we do not have a singleton soon we will be Defeated by Evolution at the fastest level where evolution is occurring. At that level, the fast growing agents gradually obtain the resources of the remaining desirable agents until all resources are taken and desirable agents become extinct.

 

6) Problems of implementation

6a) Shortage Objections. To extract coherent extrapolated volitions from people seems to be not only immensely complicated but also computationally costly. Yudkowsky proposes in CEV that we should let this initial dynamic run for a few minutes and then redesign its machine, implementing the code it develops once it is mature. But what if maturity is not achieved? What if the computational intractability of muddled concepts and spread overwhelm the computing capacity of the machine, or exceed the time it is given to process it's input?

6b) Sample bias. The CEV machine implements the volition of mankind, such is the suggestion. But from what sample of people will it extrapolate? Certainly it will not do a fine grained reading of everyone's brainstates in order to start operating, it will more likely extrapolate from sociological, anthropological and psychological information. Thus its selection of groups extrapolated will matter a lot in the long run. It may try to correct sampling bias by obtaining information about other cultures (besides programmers culture and whichever other cultures it starts with), but the vastness of human societal variation can be a hard challenge to overcome. We want to fairly take into account everyone's values, rather than privileging those of the designers.

6c) The Indeterminacy Objection. Suppose we implement the CEV of a group of people including three catholics, a muslim and two atheists, all of them English speakers. What if the CEV machine fails to consider the ethical divergence of their moral judgments by changing the meaning of the word 'god'? While extrapolating, many linguistic tokens (words) will appear (e.g. as parts of ethical imperatives). Since Quine's (1960) thesis of indeterminacy of reference, we know that the meanings of words are widely under-determined by their usage. A machine that reads my brainstate looking for cues on how to CEV may find sufficiently few mentions of a linguistic token such as 'god' that it ends up able to attribute almost any meaning to it (analogous to Löwenheim-Skolem theorem), and it may end up tampering with the token's meaning for the wrong reasons (to increase coherence at cost of precision).

 

7) Praise of CEV

7a) Bringing the issue to practical level

Despite all previous objections, CEV is a very large reduction in the problem space of how to engineer a nice future. Yudkowsky's approach is the first practical suggestion for how an artificial moral agent might do something good, as opposed to destroying humanity. Simply starting the debate of how to implement an ethical agent that is a machine built by humans is already a formidable achievement. CEV sets the initial grounding above which will be built stronger ideas for our bright future.

7b) Ethical strength of egalitarianism

CEV is a morally egalitarian ethically designed theory. Each current human stands in the same quantitative position relative to how much his volition will contribute to the final sum. Even though the CEV implementing machine will only extrapolate some subset of humans, it will try to make that subset in as much as possible a political representative of the whole.

 

8) Alternatives to CEV

8a) The Nobel Prize CEV

Here the suggestion is to do CEV on only a subset of humanity (which might be necessary anyway for computational tractability). Phlebas asks:

“[Suppose] you had to choose a certain subset of minds to participate in the initial dynamic?

What springs to my mind is Nobel Prize winners, and I suspect that this too is a Schelling point. This seems like a politically neutral selection of distinguished human beings (particularly if we exclude the Peace Prize) of superlative character and intellect.”

In the original CEV, the initial dynamic would have to either scan all brains (unlikely) or else extrapolate predictions made with its biological, sociological, anthropological and psychological resources from a subset of brains, correcting for all correctable biases in its original sample. This may be a very daunting task; It may just be easier to preselect a group and extrapolate their volition. Which computational procedures would you execute in order to be able to extrapolate a set of Jews and Arabs if your initial sample were only composed of Jews? That is, how can you predict extrapolated Arabs from Jews? This would be the level of difficulty of the task we impose on CEV if we let the original dynamic scan only western minds and try to extrapolate Pirahã, Maori, Arab, and Japanese minds out of this initial set. Instead of facing this huge multicultural demand, using Nobel winners wouldn't detract away from the initial mindset originating the CEV idea. The trade-off here is basically between democracy in one hand and tractability on the other. Still Phlebas: “I argue that the practical difficulty of incorporating all humans into the CEV in the first place is unduly great, and that the programming challenge is also made more difficult by virtue of this choice. I consider any increase in the level of difficulty in the bringing into existence of FAI to be positively dangerous, on account of the fact that this increases the window of time available for unscrupulous programmers to create uFAI. “

8b) Building Blocks for Artificial Moral Agents

In his article “Building Blocks for Artificial Moral Agents” Vincent Wiegel provides several interesting particularities that must be attended to when creating these agents: “An agent can have as one of its goals or desires to be a moral agent, but never as its only or primary goal. So the implementation of moral reasoning capability must always be in the context of some application in which it acts as a constraint on the other goals and action.” Another: “[O]nce goals have been set, these goals must have a certain stickiness. Permanent goal revision would have a paralyzing effect on an agent and possibly prevent decision making.” Even though his paper doesn't exactly provide a substitute for CEV, it provides several insights into the details that must be taken in consideration when implementing AGI. To let go of the user-friendly interface that the CEV paper has and to start thinking about how to go about implementing moral agents on a more technical ground level I suggest examining his paper as a good start.

8c) Normative approach

A normative or deontological approach would have the artificial agent following rules, that is, telling it what is or not allowed. Examples of deontological approaches are Kant's maxim, Gert's ten principles in Morality and Asimov's three laws of robotics. A normative approach doesn't work because there are several underdeterminations in telling the agent what not to do, trillions of subtle ways to destroy everything that matters without breaking any specific set of laws.

8d) Bottom up approaches

8d.1) Associative Learning

There are two alternatives to CEV that would build from the bottom up, the first is associative learning implemented by a neural network reacting to moral feedback, and the second evolutionary modeling of iterated interacting agents until the cusp of emergence of “natural” morality. In the first approach, we have a neural network learning morality like children were thought to learn in the good old blank slate days, by receiving moral feedback under several different contexts and being rewarded or punished according to societal rules. The main advantage here is tractability, algorithms for learning associatively are known and tractable thus rendering the entire process computationally viable. The disadvantage of this approach is inscrutability, we have no clear access to where within the system the moral organ is being implemented. If we cannot scrutinize it we wouldn't be able to understand eventual failures. Just one possible failure will suffice to show why bottom up associative approaches are flawed, that is the case in which an AGI learns a utility function ascribing utility to individuals self-described as 10 in their happiometers. This of course would tile the universe with sets of particles vibrating as little as possible to say “I'm happy ten” over and over again.

8d.2) Artificial Evolution

The second bottom up approach consists of evolving morality from artificial life forms. As is known, morality (or altruism) will evolve once iterated game theoretic scenarios of certain complexity start taking place in an evolving system of individuals. Pure rationality guides individuals into being nice merely because someone might be nice in return, or as Dawkins puts it, nice guys finish first. The proposal here would then be that we let artificial life forms evolve to the point where they become moral, and once they do, input AGI powers into those entities. To understand why this wouldn't work, let me quote Allen, Varner and Zinzer “ In scaling these environments to more realistic environments, evolutionary approaches are likely to be faced with some of the same shortcomings of the associative learning approaches : namely that sophisticated moral agents must also be capable of constructing an abstract, theoretical conception of morality.” If we are to end up with abstract theories of morality, a safer path would be to inscribe the theories to begin with, minimizing the risk of ending up with lower than desirable level of moral discernment. I conclude that bottom up approaches, by themselves, provide insufficient insight as to how to go about building an Artificial Moral Agent such as the one CEV proposes.

8e) Hybrid holonic ("Holonic" is a useful word to describe the simultaneous application of reductionism and holism, in which a single quality is simultaneously a combination of parts and a part of a greater whole [Koestler67]. Note that "holonic" does not imply strict hierarchy, only a general flow from high-level to low-level and vice versa.  For example, a single feature detector may make use of the output of lower-level feature detectors, and act in turn as an input to higher-level feature detectors.  The information contained in a mid-level feature is then the holistic sum of many lower-level features, and also an element in the sums produced by higher-level features.)

 A better alternative than any of the bottom up suggestions is to have a hybrid model with both deontological and bottom up elements. Our morality is partly hardwired and mostly software learning so that we are hybrid moral systems. A hybrid system may for instance be a combination of thorough learning of moral behavior by training plus Gert's set of ten moral principles. The advantage of hybrid models is that they combine partial scrutability with bottom up tractability and efficiency. In this examination of alternatives to CEV a Hybrid Holonic model is the best contestant and thus the one to which our research efforts should be directed.

 

8f) Extrapolation of written desires

Another alternative to CEV would be to extrapolate not from reading a brain-state, but from a set of written desires given by the programmers. The reason for implementing this alternative would be the technical non-feasibility of extrapolating from brain states. That is, if our Artificial General Intelligence is unable to read minds but can comprehend language. We should be prepared for this very real possibility since language is countless times simpler than active brains. To extrapolate from the entire mind is a nice ideal, but not necessarily an achievable one. To consider which kinds of desires should be written in such case is beyond the scope of this text.

8g) Using Compassion and Respect to Motivate an Artificial Intelligence.

Tim Freeman proposes what is to my knowledge the most thorough and interesting alternative to CEV to date. Tim builds up from Solomonoff induction, Schmidhuber's Speed Prior and Hutters AIXI to develop an algorithm that infers people's desires from their behavior. The algorithm is exposed in graphic form, in Python and in abstract descriptions in English. Tim's proposal is an alternative to CEV because it does not extrapolate people's current volition, thus it could only be used to produce a CV, not a CEV. His proposal deserves attention because it does, unlike most others, take in consideration the Friendly AI problem, and it actually comes with an implementation (though idealized) of the ideas presented in the text, unlike CEV. By suggesting a compassion coefficient and a (slightly larger) respect coefficient, Tim is able to solve many use cases that any desirable and friendly AGI will have to solve, in accordance to what seems moral and reasonable from a humane point of view. The text is insightful, for example, to solve wire-heading, it suggests: “The problem here is that we've assumed that the AI wants to optimize for my utility applied to my model of the real world, and in this scenario my model of the world diverges permanently from the world itself. The solution is to use the AI's model of the world instead. That is, the AI infers how my utility is a function of the world (as I believe it to be), and it applies that function to the world as the AI believes it to be to compute the AI's utility.“ It appears to me that just as any serious approach to AGI has to take in consideration Bayes, Speed Prior and AIXI, any approach to the problem that CEV tries to solve will have to consider Tim's “Using Compassion and Respect to Motivate an Artificial Intelligence” at some point, even if only to point out its mistakes and how they can be solved by posterior, more thoroughly devised algorithms. In summary, even though Tim's proposal is severely incomplete, in that it does not describe all, or even most steps that an AI must take in order to infer intentions from behavior, it is still the most complete work that tries to tackle this particular problem, while at the same time worrying about Friendliness and humaneness.

 

Studies related to CEV are few, making each more valuable, some topics that I have not had time to cover, but would like to suggest to prospective researchers are:

Solvability of remaining problems

Historical perspectives on problems

Likelihood of solving problems before 2050

How humans have dealt with unsolvable problems in the past

Troubles With CEV Part1 - CEV Sequence

7 diegocaleiro 28 February 2012 04:15AM

The CEV Sequence Summary: The CEV sequence consists of three posts tackling important aspects of CEV. It covers conceptual, practical and computational problems of CEV's current form. On What Selves Are draws on analytic philosophy methods in order to clarify the concept of Self, which is necessary in order to understand whose volition is going to be extrapolated by a machine that implements the CEV procedure. Troubles with CEV part1 and Troubles with CEV part2 on the other hand describe several issues that will be faced by the CEV project if it is actually going to be implemented. Those issues are not of conceptual nature. Many of the objections shown come from scattered discussions found on the web. Finally, some alternatives to CEV are considered.

 

Troubles with CEV Summary: Starting with a summary of CEV, we proceed to show several objections to CEV. First, specific objections to the use of Coherence, Extrapolation, and Volition. Here Part1 ends. Then, in Part2, we continue with objections related to the end product of performing a CEV, and finally, problems relating to the implementation of CEV. We then go on with a praise of CEV, pointing out particular strengths of the idea. We end by showing six alternatives to CEV that have been proposed, and considering their vices and virtues.

Meta: I think Troubles With CEV Part1 and Part2 should be posted to Main. So on the comment section of Part2, I put a place to vote for or against this upgrade.

 

Troubles with CEV Part1

 

Summary of CEV

To begin with, let us remember the most important slices of Coherent Extrapolated Volition (CEV).

“Friendly AI requires:

1.  Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)

2.  Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.

3.  Designing a framework for an abstract invariant that doesn't automatically wipe out the human species. This is the hard part.

But right now the question is whether the human species can field a non-pathetic force in defense of six billion lives and futures.”
Friendliness is the easiest part of the problem to explain - the part that says what we want. Like explaining why you want to fly to London, versus explaining a Boeing 747; explaining toast, versus explaining a toaster oven. ”

“To construe your volition, I need to define a dynamic for extrapolating your volition, given knowledge about you. In the case of an FAI, this knowledge might include a complete readout of your brain-state, or an approximate model of your mind-state. The FAI takes the knowledge of Fred's brainstate, and other knowledge possessed by the FAI (such as which box contains the diamond), does... something complicated... and out pops a construal of Fred's volition. I shall refer to the "something complicated" as the dynamic.”

This is essentially what CEV is: extrapolating Fred's mind and everyone else's in order to grok what Fred wants. This is performed from a reading of Fred's psychological states, be it through unlikely neurological paths, or through more coarse grained psychological paths. There is reason to think that a complete readout of a brain is overwhelmingly more complicated than a very good descriptive psychological approximation. We must make sure though that this approximation does not rely on our common human psychology to be understood. The descriptive approximation has to be understandable by AGI's, not only by evolutionarily engineered humans. Continuing the summary.

In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.“

Had grown up farther together: A model of humankind's coherent extrapolated volition should not extrapolate the person you'd become if you made your decisions alone in a padded cell. Part of our predictable existence is that we predictably interact with other people. A dynamic for CEV must take a shot at extrapolating human interactions, not just so that the extrapolation is closer to reality, but so that the extrapolation can encapsulate memetic and social forces contributing to niceness.“

“the rule [is] that the Friendly AI should be consistent under reflection (which might involve the Friendly AI replacing itself with something else entirely).”

The narrower the slice of the future that our CEV wants to actively steer humanity into, the more consensus required.“

“The dynamic of extrapolated volition refracts through that cognitive complexity of human minds which lead us to care about all the other things we might want; love, laughter, life, fairness, fun, sociality, self-reliance, morality, naughtiness, and anything else we might treasure. ”

“It may be hard to get CEV right - come up with an AI dynamic such that our volition, as defined, is what we intuitively want. The technical challenge may be too hard; the problems I'm still working out may be impossible or ill-defined.

“The same people who aren't frightened by the prospect of making moral decisions for the whole human species lack the interdisciplinary background to know how much complexity there is in human psychology, and why our shared emotional psychology is an invisible background assumption in human interactions, and why their Ten Commandments only make sense if you're already a human. ”

“Even if our coherent extrapolated volition wants something other than a CEV, the programmers choose the starting point of this renormalization process; they must construct a satisfactory definition of volition to extrapolate an improved or optimal definition of volition. ”

 

Troubles with CEV

1) Stumbling on People, Detecting the Things CEV Will Extrapolate:

Concepts on which CEV relies that may be ill-defined, not having a stable consistent structure in thingspace.

CEV relies on many concepts, most notably the concepts of coherence, extrapolation and volition. We will discuss the problems of coherence and extrapolation shortly, for now I'd like to invoke a deeper layer of conceptual problems regarding the execution of a CEV implementing machine. A CEV executing machine ought to be able to identify the kind of entities whose volitions matter to us, the machine must be able to grasp selfhood, or personhood. The concepts of self and person are mingled and complex, and due to their complexity I have dedicated a separate text to address the issue of incompleteness, anomalousness, and fine-grainedness of selves.

 

2) Troubles with coherence

2a) The Intrapersonal objection: The volitions of the same person when in two different emotional states might be different - it’s as if they are two different people. Is there any good criteria by which a person’s “ultimate” volition may be determined? If not, is it certain that even the volitions of one person’s multiple selves will be convergent? As explained in detail in Ainslie's “Breakdown of Will”, we are made of lots of tinier interacting time-slices whose conflicts cannot be ignored. My chocolate has value 3 now, 5 when it's in my mouth and 0 when I reconsider how quick the pleasure was and how long the fat will stay. Valuations not only interpersonally, but also intrapersonally conflict. The variation in what we value can be correlated with not only with different distances in time, but also different emotional states, priming, background assumptions and other ways in which reality hijacks brains for a period.

 

2b) The Biological Onion objection: Our volitions can be thought of to be like an onion, layers upon layers of beliefs and expectations. The suggestion made by CEV is that when you strip away the layers that do not cohere, you reach deeper regions of the onion. Now, and here is the catch, what if there is no way to get coherence unless you stripe away everything that is truly humane, and end up being left only with that which is biological. What if in service of coherence we end up stripping away everything that matters and end up only with our biological drives? There is little in common between Eliezer, Me and Al Qaeda terrorists, and most of it is in the so called reptilian brain. We may end up with a set of goals and desires that are nothing more than “Eat Survive Reproduce,” which would qualify as a major loss in the scheme of things. In this specific case, what ends up dominating CEV is what evolution wants, not what we want. Instead of creating a dynamic with a chance of creating the landscape of a Nice Place to Live, we end up with some exotic extrapolation of simple evolutionary drives. Let us call this failure mode Defeated by Evolution. We are Defeated by Evolution if at any time the destiny of earth becomes nothing more than darwinian evolution all over again, at a different level of complexity or at different speed. So if CEV ends up stripping the biological onion of its goals that matter, extrapolating only a biological core, we are defeated by evolution.

 

3) Troubles with extrapolation

3a) The Small Accretions Objection: Are small accretions of intelligence analogous to small accretions of time in terms of identity? Is extrapolated person X still a reasonable political representative of person X? Are X's values desirably preserved when she is given small accretions of intelligence? Would X allow her extrapolation to vote for her?

This objection is made through an analogy. For countless time philosophers have argued about the immortality of the soul, the existence of the soul, the complexity of the soul and last but not least the identity of the soul with itself over time.

Advancements in the field of philosophy are sparse and usually controversial, and if we were depending on a major advance in understanding of the complexity of our soul we'd be in a bad situation. Luckily, our analogy relies on the issue of personal identity, where it appears as though the issue of personal identity has been treated in sufficient detail by the book Reasons and Persons, Derek Parfit's major contribution to philosophy: Covering cases from fission and fusion to teleportation and identity over time. It is identity over time which concerns us here; Are you the same person as the person you were yesterday? How about one year ago? Or ten years? Derek has helped the philosophical community by reframing the essential question, instead of asking whether X is the same over time, he asks if personal identity is what matters, that is, that which we want to preserve when we deny others the right of shooting us. More recently he develops the question in full detail in his “Is Personal Identity What Matters?”(2007) a long article were all the objections to his original view are countered in minute detail.

We are left with a conception of identity over time not being what matters, and psychological relatedness being the best candidate to take its place. Personal identity is dissolved into a quantitative, not qualitative, question. How much are you the same and the one you were yesterday? Here some percentage enters the field, and once you know how much you are like the person you were yesterday, there is no further question about how much you are the person you were yesterday. We had been asking the wrong question for long, and we risk to be doing the same thing with CEV. What if extrapolation is a process that dissolves that which matters about us and our volitions? What if there is no transitivity of what matters between me and me+1 or me+2 in the intelligence scale? Then abstracting my extrapolation will not preserve what had to be preserved in the first place. To extrapolate our volition in case we knew more, thought faster and had grown up farther together is to accrue small quantities of intelligence during the dynamic, and doing this may be risky. Even if some of our possible extrapolations would end up generating part of a Nice Place to Be, we must be sure none of the other possible extrapolations actually happen. That is, we must make sure CEV doesn't extrapolate in a way that for each step of extrapolation, one slice of what matterness is lost. Just like small accretions of time make you every day less the person you were back in 2010, maybe small accretions of intelligence will be displacing ourselves from what is preserved. Maybe smarter versions of ourselves are not us at all - this is the The Small Accretions Objection.


4) Problems with the concept of Volition

4a) Blue minimizing robots (Yvain post)

4b) Goals vs. Volitions

The machine's actions should be grounded in our preferences, but those preferences are complex and opaque, making our reports unreliable; to truly determine the volitions of people, there must be a previously recognized candidate predictor. We test the predictor in its ability to describe current humans volitions before we give it the task of comprehending extrapolated human volition.

4c) Want to want vs. Would want if thought faster, grew stronger together

Eliezer suggests in CEV that we consider a mistake to give Fred box A if he wanted box A while thinking it contained a diamond, in case we know both that box B contains the diamond and that Fred wants the diamond. Fred's volition, we are told, is to have the diamond, and we must be careful to create machines that extrapolate volition, not mere wanting. This is good, but not enough. There is a sub-area of moral philosophy dedicated to understanding that which we value, and even though it may seem at firsthand that we value our volitions, the process that leads from wanting to having a volition is a different process than the one that leads from wanting to having a value. Values, as David Lewis has argued, are what we want to want. Volitions on the other hand are what we would ultimately want under less stringent conditions. Currently CEV does not consider the iterated wantness aspect of things we value (the want to want aspect). This is problematic in case our volitions do not happen to be constrained by what we value, that is, what we desire to desire. Suppose Fred knows that the diamond he thinks is in box A comes from a bloody conflict region. Fred hates bloodshed and he truly desires not to have desires for diamonds, he wants to be a person that doesn't want diamonds from conflict regions. Yet the flesh is weak and Fred, under the circumstance, really wants the diamond. Both Fred's current volition, and Fred's extrapolated volition would have him choose box B, if only he knew, and in neither case Fred's values have been duly considered. It may be argued that a good enough extrapolation would end up considering his disgust of war, but here we are talking not about a quantitative issue (how much improvement there was) but a qualitative leap (what kind of thing should be preserved). If it is the case, as I argue here, that we ought to preserve what we want to want, this must be done as a separate consideration, not as an addendum, to preserving our volitions, both current and extrapolated.

 

Continues in Part2

Personal research update

4 Mitchell_Porter 29 January 2012 09:32AM

Synopsis: The brain is a quantum computer and the self is a tensor factor in it - or at least, the truth lies more in that direction than in the classical direction - and we won't get Friendly AI right unless we get the ontology of consciousness right.

Followed by: Does functionalism imply dualism?

Sixteen months ago, I made a post seeking funding for personal research. There was no separate Discussion forum then, and the post was comprehensively downvoted. I did manage to keep going at it, full-time, for the next sixteen months. Perhaps I'll get to continue; it's for the sake of that possibility that I'll risk another breach of etiquette. You never know who's reading these words and what resources they have. Also, there has been progress.

I think the best place to start is with what orthonormal said in response to the original post: "I don't think anyone should be funding a Penrose-esque qualia mysterian to study string theory." If I now took my full agenda to someone out in the real world, they might say: "I don't think it's worth funding a study of 'the ontological problem of consciousness in the context of Friendly AI'." That's my dilemma. The pure scientists who might be interested in basic conceptual progress are not engaged with the race towards technological singularity, and the apocalyptic AI activists gathered in this place are trying to fit consciousness into an ontology that doesn't have room for it. In the end, if I have to choose between working on conventional topics in Friendly AI, and on the ontology of quantum mind theories, then I have to choose the latter, because we need to get the ontology of consciousness right, and it's possible that a breakthrough could occur in the world outside the FAI-aware subculture and filter through; but as things stand, the truth about consciousness would never be discovered by employing the methods and assumptions that prevail inside the FAI subculture.

Perhaps I should pause to spell out why the nature of consciousness matters for Friendly AI. The reason is that the value system of a Friendly AI must make reference to certain states of conscious beings - e.g. "pain is bad" - so, in order to make correct judgments in real life, at a minimum it must be able to tell which entities are people and which are not. Is an AI a person? Is a digital copy of a human person, itself a person? Is a human body with a completely prosthetic brain still a person?

I see two ways in which people concerned with FAI hope to answer such questions. One is simply to arrive at the right computational, functionalist definition of personhood. That is, we assume the paradigm according to which the mind is a computational state machine inhabiting the brain, with states that are coarse-grainings (equivalence classes) of exact microphysical states. Another physical system which admits the same coarse-graining - which embodies the same state machine at some macroscopic level, even though the microscopic details of its causality are different - is said to embody another instance of the same mind.

An example of the other way to approach this question is the idea of simulating a group of consciousness theorists for 500 subjective years, until they arrive at a consensus on the nature of consciousness. I think it's rather unlikely that anyone will ever get to solve FAI-relevant problems in that way. The level of software and hardware power implied by the capacity to do reliable whole-brain simulations means you're already on the threshold of singularity: if you can simulate whole brains, you can simulate part brains, and you can also modify the parts, optimize them with genetic algorithms, and put them together into nonhuman AI. Uploads won't come first.

But the idea of explaining consciousness this way, by simulating Daniel Dennett and David Chalmers until they agree, is just a cartoon version of similar but more subtle methods. What these methods have in common is that they propose to outsource the problem to a computational process using input from cognitive neuroscience. Simulating a whole human being and asking it questions is an extreme example of this (the simulation is the "computational process", and the brain scan it uses as a model is the "input from cognitive neuroscience"). A more subtle method is to have your baby AI act as an artificial neuroscientist, use its streamlined general-purpose problem-solving algorithms to make a causal model of a generic human brain, and then to somehow extract from that, the criteria which the human brain uses to identify the correct scope of the concept "person". It's similar to the idea of extrapolated volition, except that we're just extrapolating concepts.

It might sound a lot simpler to just get human neuroscientists to solve these questions. Humans may be individually unreliable, but they have lots of cognitive tricks - heuristics - and they are capable of agreeing that something is verifiably true, once one of them does stumble on the truth. The main reason one would even consider the extra complication involved in figuring out how to turn a general-purpose seed AI into an artificial neuroscientist, capable of extracting the essence of the human decision-making cognitive architecture and then reflectively idealizing it according to its own inherent criteria, is shortage of time: one wishes to develop friendly AI before someone else inadvertently develops unfriendly AI. If we stumble into a situation where a powerful self-enhancing algorithm with arbitrary utility function has been discovered, it would be desirable to have, ready to go, a schema for the discovery of a friendly utility function via such computational outsourcing.

Now, jumping ahead to a later stage of the argument, I argue that it is extremely likely that distinctively quantum processes play a fundamental role in conscious cognition, because the model of thought as distributed classical computation actually leads to an outlandish sort of dualism. If we don't concern ourselves with the merits of my argument for the moment, and just ask whether an AI neuroscientist might somehow overlook the existence of this alleged secret ingredient of the mind, in the course of its studies, I do think it's possible. The obvious noninvasive way to form state-machine models of human brains is to repeatedly scan them at maximum resolution using fMRI, and to form state-machine models of the individual voxels on the basis of this data, and then to couple these voxel-models to produce a state-machine model of the whole brain. This is a modeling protocol which assumes that everything which matters is physically localized at the voxel scale or smaller. Essentially we are asking, is it possible to mistake a quantum computer for a classical computer by performing this sort of analysis? The answer is definitely yes if the analytic process intrinsically assumes that the object under study is a classical computer. If I try to fit a set of points with a line, there will always be a line of best fit, even if the fit is absolutely terrible. So yes, one really can describe a protocol for AI neuroscience which would be unable to discover that the brain is quantum in its workings, and which would even produce a specific classical model on the basis of which it could then attempt conceptual and volitional extrapolation.

Clearly you can try to circumvent comparably wrong outcomes, by adding reality checks and second opinions to your protocol for FAI development. At a more down to earth level, these exact mistakes could also be made by human neuroscientists, for the exact same reasons, so it's not as if we're talking about flaws peculiar to a hypothetical "automated neuroscientist". But I don't want to go on about this forever. I think I've made the point that wrong assumptions and lax verification can lead to FAI failure. The example of mistaking a quantum computer for a classical computer may even have a neat illustrative value. But is it plausible that the brain is actually quantum in any significant way? Even more incredibly, is there really a valid apriori argument against functionalism regarding consciousness - the identification of consciousness with a class of computational process?

I have previously posted (here) about the way that an abstracted conception of reality, coming from scientific theory, can motivate denial that some basic appearance corresponds to reality. A perennial example is time. I hope we all agree that there is such a thing as the appearance of time, the appearance of change, the appearance of time flowing... But on this very site, there are many people who believe that reality is actually timeless, and that all these appearances are only appearances; that reality is fundamentally static, but that some of its fixed moments contain an illusion of dynamism.

The case against functionalism with respect to conscious states is a little more subtle, because it's not being said that consciousness is an illusion; it's just being said that consciousness is some sort of property of computational states. I argue first that this requires dualism, at least with our current physical ontology, because conscious states are replete with constituents not present in physical ontology - for example, the "qualia", an exotic name for very straightforward realities like: the shade of green appearing in the banner of this site, the feeling of the wind on your skin, really every sensation or feeling you ever had. In a world made solely of quantum fields in space, there are no such things; there are just particles and arrangements of particles. The truth of this ought to be especially clear for color, but it applies equally to everything else.

In order that this post should not be overlong, I will not argue at length here for the proposition that functionalism implies dualism, but shall proceed to the second stage of the argument, which does not seem to have appeared even in the philosophy literature. If we are going to suppose that minds and their states correspond solely to combinations of mesoscopic information-processing events like chemical and electrical signals in the brain, then there must be a mapping from possible exact microphysical states of the brain, to the corresponding mental states. Supposing we have a mapping from mental states to coarse-grained computational states, we now need a further mapping from computational states to exact microphysical states. There will of course be borderline cases. Functional states are identified by their causal roles, and there will be microphysical states which do not stably and reliably produce one output behavior or the other.

Physicists are used to talking about thermodynamic quantities like pressure and temperature as if they have an independent reality, but objectively they are just nicely behaved averages. The fundamental reality consists of innumerable particles bouncing off each other; one does not need, and one has no evidence for, the existence of a separate entity, "pressure", which exists in parallel to the detailed microphysical reality. The idea is somewhat absurd.

Yet this is analogous to the picture implied by a computational philosophy of mind (such as functionalism) applied to an atomistic physical ontology. We do know that the entities which constitute consciousness - the perceptions, thoughts, memories... which make up an experience - actually exist, and I claim it is also clear that they do not exist in any standard physical ontology. So, unless we get a very different physical ontology, we must resort to dualism. The mental entities become, inescapably, a new category of beings, distinct from those in physics, but systematically correlated with them. Except that, if they are being correlated with coarse-grained neurocomputational states which do not have an exact microphysical definition, only a functional definition, then the mental part of the new combined ontology is fatally vague. It is impossible for fundamental reality to be objectively vague; vagueness is a property of a concept or a definition, a sign that it is incomplete or that it does not need to be exact. But reality itself is necessarily exact - it is something - and so functionalist dualism cannot be true unless the underdetermination of the psychophysical correspondence is replaced by something which says for all possible physical states, exactly what mental states (if any) should also exist. And that inherently runs against the functionalist approach to mind.

Very few people consider themselves functionalists and dualists. Most functionalists think of themselves as materialists, and materialism is a monism. What I have argued is that functionalism, the existence of consciousness, and the existence of microphysical details as the fundamental physical reality, together imply a peculiar form of dualism in which microphysical states which are borderline cases with respect to functional roles must all nonetheless be assigned to precisely one computational state or the other, even if no principle tells you how to perform such an assignment. The dualist will have to suppose that an exact but arbitrary border exists in state space, between the equivalence classes.

This - not just dualism, but a dualism that is necessarily arbitrary in its fine details - is too much for me. If you want to go all Occam-Kolmogorov-Solomonoff about it, you can say that the information needed to specify those boundaries in state space is so great as to render this whole class of theories of consciousness not worth considering. Fortunately there is an alternative.

Here, in addressing this audience, I may need to undo a little of what you may think you know about quantum mechanics. Of course, the local preference is for the Many Worlds interpretation, and we've had that discussion many times. One reason Many Worlds has a grip on the imagination is that it looks easy to imagine. Back when there was just one world, we thought of it as particles arranged in space; now we have many worlds, dizzying in their number and diversity, but each individual world still consists of just particles arranged in space. I'm sure that's how many people think of it.

Among physicists it will be different. Physicists will have some idea of what a wavefunction is, what an operator algebra of observables is, they may even know about path integrals and the various arcane constructions employed in quantum field theory. Possibly they will understand that the Copenhagen interpretation is not about consciousness collapsing an actually existing wavefunction; it is a positivistic rationale for focusing only on measurements and not worrying about what happens in between. And perhaps we can all agree that this is inadequate, as a final description of reality. What I want to say, is that Many Worlds serves the same purpose in many physicists' minds, but is equally inadequate, though from the opposite direction. Copenhagen says the observables are real but goes misty about unmeasured reality. Many Worlds says the wavefunction is real, but goes misty about exactly how it connects to observed reality. My most frustrating discussions on this topic are with physicists who are happy to be vague about what a "world" is. It's really not so different to Copenhagen positivism, except that where Copenhagen says "we only ever see measurements, what's the problem?", Many Worlds says "I say there's an independent reality, what else is left to do?". It is very rare for a Many World theorist to seek an exact idea of what a world is, as you see Robin Hanson and maybe Eliezer Yudkowsky doing; in that regard, reading the Sequences on this site will give you an unrepresentative idea of the interpretation's status.

One of the characteristic features of quantum mechanics is entanglement. But both Copenhagen, and a Many Worlds which ontologically privileges the position basis (arrangements of particles in space), still have atomistic ontologies of the sort which will produce the "arbitrary dualism" I just described. Why not seek a quantum ontology in which there are complex natural unities - fundamental objects which aren't simple - in the form of what we would presently called entangled states? That was the motivation for the quantum monadology described in my other really unpopular post. :-) [Edit: Go there for a discussion of "the mind as tensor factor", mentioned at the start of this post.] Instead of saying that physical reality is a series of transitions from one arrangement of particles to the next, say it's a series of transitions from one set of entangled states to the next. Quantum mechanics does not tell us which basis, if any, is ontologically preferred. Reality as a series of transitions between overall wavefunctions which are partly factorized and partly still entangled is a possible ontology; hopefully readers who really are quantum physicists will get the gist of what I'm talking about.

I'm going to double back here and revisit the topic of how the world seems to look. Hopefully we agree, not just that there is an appearance of time flowing, but also an appearance of a self. Here I want to argue just for the bare minimum - that a moment's conscious experience consists of a set of things, events, situations... which are simultaneously "present to" or "in the awareness of" something - a conscious being - you. I'll argue for this because even this bare minimum is not acknowledged by existing materialist attempts to explain consciousness. I was recently directed to this brief talk about the idea that there's no "real you". We are given a picture of a graph whose nodes are memories, dispositions, etc., and we are told that the self is like that graph: nodes can be added, nodes can be removed, it's a purely relational composite without any persistent part. What's missing in that description is that bare minimum notion of a perceiving self. Conscious experience consists of a subject perceiving objects in certain aspects. Philosophers have discussed for centuries how best to characterize the details of this phenomenological ontology; I think the best was Edmund Husserl, and I expect his work to be extremely important in interpreting consciousness in terms of a new physical ontology. But if you can't even notice that there's an observer there, observing all those parts, then you won't get very far.

My favorite slogan for this is due to the other Jaynes, Julian Jaynes. I don't endorse his theory of consciousness at all; but while in a daydream he once said to himself, "Include the knower in the known". That sums it up perfectly. We know there is a "knower", an experiencing subject. We know this, just as well as we know that reality exists and that time passes. The adoption of ontologies in which these aspects of reality are regarded as unreal, as appearances as only, may be motivated by science, but it's false to the most basic facts there are, and one should show a little more imagination about what science will say when it's more advanced.

I think I've said almost all of this before. The high point of the argument is that we should look for a physical ontology in which a self exists and is a natural yet complex unity, rather than a vaguely bounded conglomerate of distinct information-processing events, because the latter leads to one of those unacceptably arbitrary dualisms. If we can find a physical ontology in which the conscious self can be identified directly with a class of object posited by the theory, we can even get away from dualism, because physical theories are mathematical and formal and make few commitments about the "inherent qualities" of things, just about their causal interactions. If we can find a physical object which is absolutely isomorphic to a conscious self, then we can turn the isomorphism into an identity, and the dualism goes away. We can't do that with a functionalist theory of consciousness, because it's a many-to-one mapping between physical and mental, not an isomorphism.

So, I've said it all before; what's new? What have I accomplished during these last sixteen months? Mostly, I learned a lot of physics. I did not originally intend to get into the details of particle physics - I thought I'd just study the ontology of, say, string theory, and then use that to think about the problem. But one thing led to another, and in particular I made progress by taking ideas that were slightly on the fringe, and trying to embed them within an orthodox framework. It was a great way to learn, and some of those fringe ideas may even turn out to be correct. It's now abundantly clear to me that I really could become a career physicist, working specifically on fundamental theory. I might even have to do that, it may be the best option for a day job. But what it means for the investigations detailed in this essay, is that I don't need to skip over any details of the fundamental physics. I'll be concerned with many-body interactions of biopolymer electrons in vivo, not particles in a collider, but an electron is still an electron, an elementary particle, and if I hope to identify the conscious state of the quantum self with certain special states from a many-electron Hilbert space, I should want to understand that Hilbert space in the deepest way available.

My only peer-reviewed publication, from many years ago, picked out pathways in the microtubule which, we speculated, might be suitable for mobile electrons. I had nothing to do with noticing those pathways; my contribution was the speculation about what sort of physical processes such pathways might underpin. Something I did notice, but never wrote about, was the unusual similarity (so I thought) between the microtubule's structure, and a model of quantum computation due to the topologist Michael Freedman: a hexagonal lattice of qubits, in which entanglement is protected against decoherence by being encoded in topological degrees of freedom. It seems clear that performing an ontological analysis of a topologically protected coherent quantum system, in the context of some comprehensive ontology ("interpretation") of quantum mechanics, is a good idea. I'm not claiming to know, by the way, that the microtubule is the locus of quantum consciousness; there are a number of possibilities; but the microtubule has been studied for many years now and there's a big literature of models... a few of which might even have biophysical plausibility.

As for the interpretation of quantum mechanics itself, these developments are highly technical, but revolutionary. A well-known, well-studied quantum field theory turns out to have a bizarre new nonlocal formulation in which collections of particles seem to be replaced by polytopes in twistor space. Methods pioneered via purely mathematical studies of this theory are already being used for real-world calculations in QCD (the theory of quarks and gluons), and I expect this new ontology of "reality as a complex of twistor polytopes" to carry across as well. I don't know which quantum interpretation will win the battle now, but this is new information, of utterly fundamental significance. It is precisely the sort of altered holistic viewpoint that I was groping towards when I spoke about quantum monads constituted by entanglement. So I think things are looking good, just on the pure physics side. The real job remains to show that there's such a thing as quantum neurobiology, and to connect it to something like Husserlian transcendental phenomenology of the self via the new quantum formalism.

It's when we reach a level of understanding like that, that we will truly be ready to tackle the relationship between consciousness and the new world of intelligent autonomous computation. I don't deny the enormous helpfulness of the computational perspective in understanding unconscious "thought" and information processing. And even conscious states are still states, so you can surely make a state-machine model of the causality of a conscious being. It's just that the reality of how consciousness, computation, and fundamental ontology are connected, is bound to be a whole lot deeper than just a stack of virtual machines in the brain. We will have to fight our way to a new perspective which subsumes and transcends the computational picture of reality as a set of causally coupled black-box state machines. It should still be possible to "port" most of the thinking about Friendly AI to this new ontology; but the differences, what's new, are liable to be crucial to success. Fortunately, it seems that new perspectives are still possible; we haven't reached Kantian cognitive closure, with no more ontological progress open to us. On the contrary, there are still lines of investigation that we've hardly begun to follow.

Would you like to give me feedback for "Troubles With CEV"

-9 diegocaleiro 24 December 2011 09:22PM

Hi, I'm going to publish soon here a study of CEV composed of two texts "On What is a Self" and "Troubles with CEV"  Would you like to give feedback prior to publication? 

If so, please provide your e-mail address and I will send you the text.

Merry Newtonmas

Would you like to give me feedback for "On What is a Self"

-8 diegocaleiro 24 December 2011 09:21PM

Hi, I'm going to publish soon here a study of CEV composed of two texts "On What is a Self" and "Troubles with CEV"  Would you like to give feedback prior to publication? 

If so, please provide your e-mail address and I will send you the text.

Merry Newtonmas

Value evolution

14 PhilGoetz 08 December 2011 11:47PM

Coherent extrapolated volition (CEV) asks what humans would want, if they knew more - if their values reached reflective equilibrium.  (I don't want to deal with the problems of whether there are "human values" today; for the moment I'll consider the more-plausible idea that a single human who lived forever could get smarter and closer to reflective equilibrium over time.)

This is appealing because it seems compatible with moral progress (see e.g., Muehlhauser & Helm, "The singularity and machine ethics", in press).  Morality has been getting better over time, right?  And that's because we're getting smarter, and closer to reflective equilibrium as we revise our values in light of our increased understanding, right?

This view makes three claims:

  1. Morality has improved over time.
  2. Morality has improved as a result of reflection.
  3. This improvement brings us closer to equilibrium over time.

There can be no evidence for the first claim, and the evidence is against the second two claims.

continue reading »

CEV-inspired models

7 Stuart_Armstrong 07 December 2011 06:35PM

I've been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might - or might not - do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.

Ok, that's an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them - I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.

So I'd like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.

To start it off, here's my (simplistic) suggestion:

Volition

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

Extrapolation

Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don't know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you've never heard of non-standard realslexicographical preferences, refusal to choose and related issues) everyone's preferences are now expressible as utility functions.

Coherence

Normalise each existing person's utility function and add them together to get your CEV. At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

Why study the cognitive science of concepts?

8 lukeprog 03 December 2011 11:56PM

I've written several posts on the cognitive science of concepts in prep for later posts in my metaethics sequence. Why?

I'm doing this because one common way of trying to solve the "friendliness content" problem in Friendly AI theory is to analyze (via thought experiment and via cognitive science) our concept of "good" or "ought" or "right" so that we can figure out what an FAI "ought" to do, or what it would be "good" for an FAI to do, or what it would be "right" for an FAI to do.

That's what Eliezer does in The Meaning of Right, that's what many other LWers do, and that's what most mainstream metaethicists do.

With my recent posts on the cognitive science of concepts, I'm trying to show that cognitive science presents a number of difficult problems for this approach.

Let me illustrate with a concrete example. Math prodigy Will Sawin once proposed to me (over the phone) that our concept of "ought" might be realized by way of something like a dedicated cognitive module. In an earlier comment, I tried to paraphrase his idea:

Imagine a species of artificial agents. These agents have a list of belief statements that relate physical phenomena to normative properties (let's call them 'moral primitives'):

  • 'Liking' reward signals in human brains are good.
  • Causing physical pain in human infants is forbidden.
  • etc.

These agents also have a list of belief statements about physical phenomena in general:

  • Sweet tastes on the tongue produces reward signals in human brains.
  • Cutting the fingers of infants produces physical pain in infants.
  • Things are made of atoms.
  • etc.

These agents also have an 'ought' function that includes a series of logical statements that relate normative concepts to each other, such as:

  • A thing can't be both permissible and forbidden.
  • A thing can't be both obligatory and non-obligatory.
  • etc.

Finally, these robots have actuators that are activated by a series of rules like:

  • When the agent observes an opportunity to perform an action that is 'obligatory', then it will take that action.
  • An agent will avoid any action that is labeled as 'forbidden.'

Some of these rules might include utility functions that encode ordinal or cardinal value for varying combinations of normative properties.

These agents can't see their own source code. The combination of the moral primitives and the ought function and the non-ought belief statements and a set of rules about behavior produces their action and their verbal statements about what ought to be done.

From their behavior and verbal ought statements these robots can infer to some degree how their ought function works, but they can't fully describe their ought function because they haven't run enough tests or the ought function is just too complicated or the problem is made worse because they also can't see their moral primitives.

The ought function doesn't reduce to physics because it's a set of purely logical statements. The 'meaning' of ought in this sense is determined by the role that the ought function plays in producing intentional behavior by the robots.

Of course, the robots could speak in ought language in stipulated ways, such that 'ought' means 'that which produces pleasure in human brains' or something like that, and this could be a useful way to communicate efficiently, but it wouldn't capture what the ought function is doing or how it is contributing to the production of behavior by these agents.

What Will is saying is that it's convenient to use 'ought' language to refer to this ought function only, and not also to a combination of the ought function and statements about physics, as happens when we stipulatively use 'ought' to talk about 'that which produces well-being in conscious creatures' (for example).

I'm saying that's fine, but it can also be convenient (and intuitive) for people to use 'ought' language in ways that reduce to logical-physical statements, and not only in ways that express a logical function that contains only transformations between normative properties. So we don't have substantive disagreement on this point; we merely have different intuitions about the pragmatic value of particular uses for 'ought' language.

We also drew up a simplified model of the production of human action in which there is a cognitive module that processes the 'ought' function (made of purely logical statements like in the robots' ought function), a cognitive module that processes habits, a cognitive module that processes reflexes, and so on. Each of these produces an output, and another module runs arg(max) on these action options to determine which actions 'wins' and actually occurs.

Of course, the human 'ought' function is probably spread across multiple modules, as is the 'habit' function.

Will likes to think of the 'meaning' of 'ought' as being captured by the algorithm of this 'ought' function in the human brain. This ought function doesn't contain physical beliefs, but rather processes primitive normative/moral beliefs (from outside the ought function) and outputs particular normative/moral judgments, which contribute to the production of human behavior (including spoken moral judgments). In this sense, 'ought' in Will's sense of the term doesn't reduce to physical facts, but to a logical function...

Will also thinks that the 'ought' function (in his sense) inside human brains is probably very similar between humans - ones that aren't brain damaged or neurologically deranged... [And] if the 'ought' function is the same in all healthy humans, then there needn't be a separate 'meaning' of ought (in Will's sense) for each speaker, but instead there could be a shared 'meaning' of ought (in Will's sense) that is captured by the algorithms of the 'ought' cognitive module that is shared by healthy human brains.

The reason I'm investigating the cognitive science of concepts is because I think it shows that the claims about the human brain in these last two paragraphs are probably false, and so are many other claims about human brains that are implicit in certain varieties of the 'conceptual analysis' approach to value theory.

Studying Psychology - Which path should I take to best help our cause? Suggestions please.

4 Friendly-HI 23 November 2011 07:52PM

If you solve the problem of human-friendly self-improving AI, you have indirectly solved every problem. After spending a decent amount of time on LW, I have been convinced of this premise and now I would like to devote my life to that cause.

 

Currently I'm living in Germany and I'm studying psychology in the first semester. The university I'm studying at has a great reputation (even internationally if I can believe the rankings) for the quality of its scientific psychology research and it ranks about #2nd or #3rd place when it comes to various psy-science-related criteria out of about 55 German universities where one can study psychology. Five semesters of statistics in my Bachelor of Science might also hint at that.

I want to finish my Bachelor of Science and then move on to my Master, so in about 5 years I might hit my "phase of actual productivity" in the working field. I'm flirting with cognitive neuroscience, but haven't made my decision yet - however, I am pretty sure that I want to move towards research and a scientific career rather than one in a therapeutic field.

Before discovering lesswrong my most dominant personal interest in psychology has been in the field of "positive psychology" or plainly speaking the "what makes humans happy" field. This interest hasn't really changed through the discovery of LW, as much as it has evolved into: "how can we distill what makes human life worthwhile and put it into terms a machine could execute for our benefit"?

 

As the title suggests, I'm writing all this because I want some creative input from you in order to expand my sense of possibilities concerning how I can help the development of friendly AI from the field of psychology most effectively.

 

To give you a better idea of what might fit me, a bit more background-info about myself and my abilities seems in order:

I like talking and writing a lot, mathematically I am a loser (whether due to early disgust or incompetence I can't really tell). I value and enjoy human contact and have constantly moved from being an introvert towards being an extrovert by several cognitive developments I can only speculate on. I would probably easily rank in the middle field of any positive extroversion scale nowadays. My IQ seems to be around 134 if one can trust the "International High IQ Society" (www.highiqsociety.org), but as mentioned my abilities probably lie more in the linguistic and to some extent analytic sphere than the mathematical. I understand Bayes' Theorem but haven't read the quantum mechanics sequence and many "higher" concepts here are still above my current level of comprehension. Although I haven't tried all that hard yet to be fair.

I have programmed some primitive HTML and CSS once and didn't really like it. From that experiecne and my mathematical inability I take away, that programming wouldn't be the way that I could contribute most efficiently towards friendly AI-research. It is none of my strenghts or at least it would take a lot of time to develop that, which would probably be better spent somewhere else. Also I quite surely wouldn't enjoy it as much as work in the psychological realm with humans.

My English is almost indistinguishable from that of a native speaker and I largely lack that (rightfully) despised and annoying German accent, so I could definitely see myself giving competent talks in English.

Like many of you I have serious problems with akrasia (regardless of whether that's a rationalist phenomenon or whether we are just more aware of it and tend to do types of work that tempt it more readily). Before I learned of how to effectively combat it (thank you Piers Steel!), I had plenty of motivation to get rid of it and sunk insane efforts into overcoming it, although ultimately it was largely an unsuccessful undertaking due to half-assed pop-science and the lack of a real insight about what procrastination is caused by and how it actually functions. Now that I know how to fix procrastination (or rather now that I know that it can't be fixed, as much as it has to be managed in a similar fashion to any given drug-addition), my motivation to overcome it is almost gone and I feel myself slacking. Also, the high certainty that there is no such thing as "free will" may have played a serious part in my procrastination habits (interestingly, there are at least two papers I recall showing this correlation). In a nutshell: Procrastination is a problem that I need to address, since it is definitely the Achilles' heel of my performance and it's absolutely crippling my potential. I probably rank middle-high on the impulsiveness- (and thus also on the procrastination-) scale.

That should be an adequate characterization of myself for now.

 

I am absolutely open for suggestions that are not related to the neuroscience of "what makes humans happy and how do I distill those goals and feelings into something a machine could work with"-field, but currently I am definitely flirting with that idea, even though I have absolutely no clue how the heck this area of research could be sufficiently financed in a decade from now and how it could spit out findings precise enough to benefit the creation of FAI. Yet maybe it's just a lack of imagination.

Trying to help set up and evolve a rationalist community in Germany would also be a decent task, but compared to specific research that actually directly aids our goals... I somehow feel it is less than what I could reasonably achieve if I really set my mind to it.

 

So tell me, where does a German psychologist go nowadays to achieve the biggest possible positive impact in the field of friendly AI?

Draft of Muehlhauser & Helm, 'The Singularity and Machine Ethics'

8 lukeprog 18 November 2011 07:00AM

Louie and I are sharing a draft of our chapter submission to The Singularity Hypothesis for feedback:

The Singularity and Machine Ethics

Thanks in advance.

Also, thanks to Kevin for suggesting in February that I submit an abstract to the editors. Seems like a lifetime ago, now.

Edit: As of 3/31/2012, the link above now points to a preprint.

Wanted: backup plans for "seed AI turns out to be easy"

18 Wei_Dai 28 September 2011 09:54PM

Earlier, I argued that instead of working on FAI, a better strategy is to pursue an upload or IA based Singularity. In response to this, some argue that we still need to work on FAI/CEV, because what if it turns out that seed AI is much easier than brain emulation or intelligence amplification, and we can't stop or sufficiently delay others from building them? If we had a solution to CEV, we could rush to build a seed AI ourselves, or convince others to make use of the ideas.

But CEV seems a terrible backup plan for this contingency, since it involves lots of hard philosophical and implementation problems and therefore is likely to arrive too late if seed AI turns out to be easy. (Searching for whether Eliezer or someone else addressed the issue of implementation problems before, I found just a couple of sentences, in the original CEV document: "The task of construing a satisfactory initial dynamic is not so impossible as it seems. The satisfactory initial dynamic can be coded and tinkered with over years, and may improve itself in obvious and straightforward ways before taking on the task of rewriting itself entirely." Which does not make any sense to me—why can't every other AGI builder make the same argument, that their code can be "tinkered with" over many years, and therefore is safe? Why aren't we risking the "initial dynamic" FOOMing while it's being tinkered with? Actually, it seems to me that an AI cannot begin to extrapolate anyone's volition until it's already more powerful than a human, so I have no idea how the tinkering is supposed to work at all.)

So, granting that "seed AI is much easier than brain emulation or intelligence amplification" is a very real possibility, I think we need better backup plans. This post is a bit similar to The Friendly AI Game, in that I'm asking for a utility function for a seed AI, but the goal here is not necessarily to build an FAI directly, but to somehow make an eventual positive Singularity more likely, while keeping the utility function simple enough that there's a good chance it can be specified and implemented correctly within a relatively short amount of time. Also, the top entry in that post is an AI that can answer formally specified questions with minimal side effects, apparently with the idea that we can use such an AI to advance many kinds of science and technology. But I agree with Nesov—such an AI doesn't help, if the goal is an eventual positive Singularity:

We can do lots of useful things, sure (this is not a point where we disagree), but they don't add up towards "saving the world". These are just short-term benefits. Technological progress makes it easier to screw stuff up irrecoverably, advanced tech is the enemy. One shouldn't generally advance the tech if distant end-of-the-world is considered important as compared to immediate benefits [...]

To give an idea of the kind of "backup plan" I have in mind, one idea I've been playing with is to have the seed AI make multiple simulations of the entire Earth (i.e., with different "random seeds"), for several years or decades into the future, and have a team of humans pick the best outcome to be released into the real world. (I say "best outcome" but many of the outcomes will probably be incomprehensible or dangerous to directly observe, so they should mostly judge the processes that lead to the outcomes instead of the outcomes themselves.) This is still quite complex if you think about how to turn this "wish" into a utility function, and lots of things could still go wrong, but to me it seems at least the kind of problem that a team of human researchers/programmers can potentially solve within the relevant time frame.

Do others have any ideas in this vein?

Guardian Angels: Discrete Extrapolated Volitions

0 lessdazed 25 September 2011 10:51PM

Questions for discussion, with my tentative answers. Assuming I am wrong about some things, there is something interesting to consider. This is inspired by the recent SL4-type and CEV-centric topics in the discussion section.

 

Questions:

I

  1. Is it easier to calculate the extrapolated volition of an individual or a group? 
  2. If it is easier to do for an individual, is it because it is strictly simpler to do it, in that calculating humanity's CEV involves making at least every calculation that would be made for calculating the extrapolated volition of one individual? 
  3. How definitively can these questions be answered without knowing exactly how to calculate CEV?

II

  1. Is it possible to create multiple AIs such that one AI does not prevent others from being created, such as by releasing equally powerful AIs simultaneously? 
  2. Is it possible to box AIs such that they reliably never escape before a certain, if short, period of time, such as by giving them a low-cost way out with a calculable minimum and maximum time to exploit that route? 
  3. Is it likely there would be a cooperative equilibrium among unmerged AIs?

III

  1. Assuming the possibility of all of the following: what would happen if every person had a superintelligent AI with a utility function of that person's idealized extrapolated utility function?
  2. How would that compare to a scenario with a single AI embodying a successful calculation of CEV?
  3. What would be different if a person or some few people did not have a superintelligence valuing what they would value, and only many people had their own AI?

 

My Answers:

I

  1. It depends on the error level tolerated. If only very low error is tolerated, it is easier to do it for a group.
  2. N/A
  3. Not sure.

II

  1. Probably not.
  2. Maybe, probably not, but impossible to know with high confidence.
  3. Probably not. Throughout history, offense has often been a step ahead of defense, which often catches up to it. I think this is not particular to evolutionary biology or the technologies that happen to have been developed. It seems easier to break complicated things with many moving parts than to build and defend them. Also, specific technologies people plausibly speculate may exist are more powerful offensively than defensively. I would expect them to merge, probably peacefully.

III

  1. Hard to say, as that would be trying to predict the actions of more intelligent beings in a dynamic environment.
  2. It might be better, or worse. The chance of it being similar is notably high.
  3. Not sure.

Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition

10 lukeprog 25 September 2011 11:15AM

Suppose we want to use the convergence of humanity's preferences as the utility function of a seed AI that is about to determine the future of its light cone.

We figured out how to get an AI to extract preferences from human behavior and brain activity. The AI figured out how to extrapolate those values. But my values and your values and Sarah Palin's values aren't fully converging in the simulation running the extrapolation algorithm. Our simulated beliefs are converging because on the path to reflective equilibrium our partially simulated selves have become true Bayesians and Aumann's Agreement Theorem holds. But our preferences aren't converging quite so well.

What to do? We'd like the final utility function in the FOOMed AI to adhere to some common-sense criteria:

  1. Non-dictatorship: No single person's preferences should dictate what the AI does. Its utility function must take multiple people's (extrapolated) preferences into account.
  2. Determinism: Given the same choices, and the same utility function, the AI should always make the same decisions.
  3. Pareto efficiency: If every (extrapolated) person prefers action A to action B, the AI should prefer A to B.
  4. Independence of irrelevant alternatives: If we — a group of extrapolated preference-sets — prefer A to B, and a new option C is introduced, then we should still prefer A to B regardless of what we think about C.

Now, Arrow's impossibility theorem says that we can only get the FOOMed AI's utility function to adhere to these criteria if the extrapolated preferences of each partially simulated agent are related to each other cardinally ("A is 2.3x better than B!") instead of ordinally ("A is better than B, and that's all I can say").

Now, if you're an old-school ordinalist about preferences, you might be worried. Ever since Vilfredo Pareto pointed out that cardinal models of a person's preferences go far beyond our behavioral data and that as far as we can tell utility has "no natural units," some economists have tended to assume that, in our models of human preferences, preference must be represented ordinally and not cardinally.

But if you're keeping up with the latest cognitive neuroscience, you might not be quite so worried. It turns out that preferences are encoded cardinally after all, and they do have a natural unit: action potentials per second. With cardinally encoded preferences, we can develop a utility function that represents our preferences and adheres to the common-sense criteria listed above.

Whaddya know? The last decade of cognitive neuroscience has produced a somewhat interesting result concerning the plausibility of CEV.

My true rejection

-16 dripgrind 14 July 2011 10:04PM

Here's why I'm not going to give money to the SIAI any time soon.

Let's suppose that Friendly AI is possible. In other words, it's possible that a small subset of humans can make a superhuman AI which uses something like Coherent Extrapolated Volition to increase the happiness of humans in general (without resorting to skeevy hacks like releasing an orgasm virus).

Now, the extrapolated volition of all humans is probably a tricky thing to determine. I don't want to get sidetracked into writing about my relationship history, but sometimes I feel like it's hard to extrapolate the volition of one human.

If it's possible to make a Friendly superhuman AI that optimises CEV, then it's surely way easier to make an unFriendly superhuman AI that optimises a much simpler variable, like the share price of IBM.

Long before a Friendly AI is developed, some research team is going to be in a position to deploy an unFriendly AI that tries to maximise the personal wealth of the researchers, or the share price of the corporation that employs them, or pursues some other goal that the rest of humanity might not like.

And who's going to stop that happening? If the executives of Corporation X are in a position to unleash an AI with a monomaniacal dedication to maximising the Corp's shareholder value, it's probably illegal for them not to do just that.

If you genuinely believe that superhuman AI is possible, it seems to me that, as well as sponsoring efforts to design Friendly AI, you need to (a) lobby against AI research by any groups who aren't 100% committed to Friendly AI (pay off reactionary politicians so AI regulation becomes a campaign issue, etc.) (b) assassinate any researchers who look like they're on track to deploying an unFriendly AI, then destroy their labs and backups.

But SIAI seems to be fixated on design at the expense of the other, equally important priorities. I'm not saying I expect SIAI to pursue illegal goals openly, but there is such a thing as a false-flag operation.

While Michelle Bachmann isn't talking about how AI research is a threat to the US constitution, and Ben Goertzel remains free and alive, I can't take the SIAI seriously.

Topics to discuss CEV

6 diegocaleiro 06 July 2011 02:19PM

     CEV is our current proposal for what ought to be done once you have AGI flourishing around. Many people have had bad feelings about this. When in Singularity Institute, I decided to write a text do discuss CEV, from what it is for, to how likely it is to achieve it's goals, and how much fine-grained detail needs to be added before it is an actual theory.

Here you find a draft of the topics I'll be discussing in that text. The purpose of showing this is that you take a look at the topics, spot something that is missing, and write a comment saying: "Hey, you forgot this problem, which, summarised, is bla bla bla bla" and also "be sure to mention paper X when discussing topic 2.a.i,"

Please take a few minutes to help me add better discussions.

Do not worry about pointing previous Less Wrong posts about it, I have them all.

 

  1. Summary of CEV
  2. Troubles with CEV
    1. Troubles with the overall suggestion
      1. Concepts on which CEV relies that may not be well shaped enough
    2. Troubles with coherence
      1. The volitions of the same person when in two different emotional states might be different - it’s as if they are two different people. Is there any good criteria by which a person’s “ultimate” volition may be determined? If not, is it certain that even the volitions of one person’s multiple selves will be convergent?
      2. But when you start dissecting most human goals and preferences, you find they contain deeper layers of belief and expectation. If you keep stripping those away, you eventually reach raw biological drives which are not a human belief or expectation. (Though even they are beliefs and expectations of evolution, but let’s ignore that for the moment.)
      3. Once you strip away human beliefs and expectations, nothing remains but biological drives, which even the animals have. Yes, an animal, by virtue of its biological drives and ability to act, is more than a predicting rock, but that doesn’t address the issue at hand.
    3. Troubles with extrapolation
      1. Are small accretions of inteligence analogous to small accretions of time in terms of identity? Is extrapolated person X still a reasonable political representant of person X?
    4. Problems with the concept of Volition
      1. Blue eliminating robots (Yvain post)
      2. Error minimizer
      3. Goals x Volitions
    5. Problems of implementation
      1. Undesirable solutions for hardware shortage, or time shortage (the machine decides to only CV, but not E)
      2. Sample bias
      3. Solving apparent non-coherence by meaning shift
  3. Praise of CEV
    1. Bringing the issue to practical level
    2. Ethical strenght of egalitarianism

 

  1. Alternatives to CEV
    1. (                     )
    2. (                     )
    3. Normative approach
    4. Extrapolation of written desires

 

  1. Solvability of remaining problems
    1. Historical perspectives on problems
    2. Likelihood of solving problems before 2050
    3. How humans have dealt with unsolvable problems in the past

 

SMBC: dystopian objective function

8 Jonathan_Graehl 24 June 2011 04:03AM

Cartoon: http://www.smbc-comics.com/index.php?db=comics&id=2286 evokes the horror you should feel imagining your values being modified arbitrarily, although in the comic there's slippery-slope consent at each step.

This reminds me of a sci-fi novel where the participants are playing a game where points are awarded for "traditional" early 20th century behavior (the original records are lost, and some virus has infected the teleportation gates). Unfortunately I can't remember the author or name; it was pretty decent. Anyone recall it?

Beginning resources for CEV research

14 lukeprog 07 May 2011 05:28AM

I've been working on metaethics/CEV research for a couple months now (publishing mostly prerequisite material) and figured I'd share some of the sources I've been using.

 

CEV sources.

 

Motivation. CEV extrapolates human motivations/desires/values/volition. As such, it will help to understand how human motivation works.


Extrapolation. Is it plausible to think that some kind of extrapolation of human motivations will converge on a single motivational set? How would extrapolation work, exactly?

  • Reflective equilibrium. Yudkowsky's proposed extrapolation works analogously to what philosophers call 'reflective equilibrium.' The most thorough work here is the 1996 book by Daniels, and there have been lots of papers, but this genre is only barely relevant for CEV. Basically, an entirely new literature on volition-extrapolation algorithms needs to be created.
  • Full-information accounts of value and ideal observer theories. This is what philosophers call theories of value that talk about 'what we would want if we were fully informed, etc.' or 'what a perfectly informed agent would want' like CEV does. There's some literature on this, but it's only marginally relevant to CEV. Again, an entirely new literature needs to be written to solve this problem.


Metaethics. Should we use CEV, or something else? What does 'should' mean?


Building the utility function. How can a seed AI be built? How can it read what to value?


Preserving the utility function. How can the motivations we put into a superintelligence be preserved over time and self-modifcation?


Reflective decision theory. Current decision theories tell us little about software agents that make decisions to modify their own decision-making mechanisms.


Additional suggestions welcome. I'll try to keep this page up-to-date.

 

The UFAI among us

1 PhilGoetz 08 February 2011 11:29PM

Completely artificial intelligence is hard.  But we've already got humans, and they're pretty smart - at least smart enough to serve some useful functions.  So I was thinking about designs that would use humans as components - like Amazon's Mechanical Turk, but less homogenous.  Architectures that would distribute parts of tasks among different people.

Would you be less afraid of an AI like that?  Would it be any less likely to develop its own values, and goals that diverged widely from the goals of its constituent people?

Because you probably already are part of such an AI.  We call them corporations.

Corporations today are not very good AI architectures - they're good at passing information down a hierarchy, but poor at passing it up, and even worse at adding up small correlations in the evaluations of their agents.  In that way they resemble AI from the 1970s.  But they may provide insight into the behavior of AIs.  The values of their human components can't be changed arbitrarily, or even aligned with the values of the company, which gives them a large set of problems that AIs may not have.  But despite being very different from humans in this important way, they end up acting similar to us.

Corporations develop values similar to human values.  They value loyalty, alliances, status, resources, independence, and power.  They compete with other corporations, and face the same problems people do in establishing trust, making and breaking alliances, weighing the present against the future, and game-theoretic strategies.  They even went through stages of social development similar to those of people, starting out as cutthroat competitors, and developing different social structures for cooperation (oligarchy/guild, feudalism/keiretsu, voters/stockholders, criminal law/contract law).  This despite having different physicality and different needs.

It suggests to me that human values don't depend on the hardware, and are not a matter of historical accident.  They are a predictable, repeatable response to a competitive environment and a particular level of intelligence.

As corporations are larger than us, with more intellectual capacity than a person, and more complex laws governing their behavior, it should follow that the ethics developed to govern corporations are more complex than the ethics that govern human interactions, and a good guide for the initial trajectory of values that (other) AIs will have.  But it should also follow that these ethics are too complex for us to perceive.

Criticisms of CEV (request for links)

7 Kevin 16 November 2010 04:02AM

I know Wei Dai has criticized CEV as a construct, I believe offering the alternative of rigorously specifying volition *before* making an AI. I couldn't find these posts/comments via a search, can anyone link me? Thanks.

There may be related top-level posts, but there is a good chance that what I am specifically thinking of was a comment-level conversation between Wei Dai and Vladimir Nesov.

Also feel free to use this thread to criticize CEV and to talk about other possible systems of volition.

Waser's 3 Goals of Morality

-12 mwaser 02 November 2010 07:12PM

In the spirit of Asimov’s 3 Laws of Robotics

  1. You should not be selfish
  2. You should not be short-sighted or over-optimize
  3. You should maximize the progress towards and fulfillment of all conscious and willed goals, both in terms of numbers and diversity equally, both yours and those of others equally

It is my contention that Yudkowsky’s CEV converges to the following 3 points:

  1. I want what I want
  2. I recognize my obligatorily gregarious nature; realize that ethics and improving the community is the community’s most rational path towards maximizing the progress towards and fulfillment of everyone’s goals; and realize that to be rational and effective the community should punish anyone who is not being ethical or improving the community (even if the punishment is “merely” withholding help and cooperation)
  3. I shall, therefore, be ethical and improve the community in order to obtain assistance, prevent interference, and most effectively achieve my goals

I further contend that, if this CEV is translated to the 3 Goals above and implemented in a Yudkowskian Benevolent Goal Architecture (BGA), that the result would be a Friendly AI.

It should be noted that evolution and history say that cooperation and ethics are stable attractors while submitting to slavery (when you don’t have to) is not.  This formulation expands Singer’s Circles of Morality as far as they’ll go and tries to eliminate irrational Us-Them distinctions based on anything other than optimizing goals for everyone — the same direction that humanity seems headed in and exactly where current SIAI proposals come up short.

Once again, cross-posted here on my blog (unlike my last article, I have no idea whether this will be karma'd out of existence or not ;-)

View more: Next