eli_sennesh comments on Why I haven't signed up for cryonics - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (249)
Good news: this one's remarkably unlikely, since almost all existing Friendly AI approaches are indirect ("look at some samples of real humans and optimize for the output of some formally-specified epistemic procedure for determining their values") rather than direct ("choirs of angels sing to the Throne of God").
Not sure how that helps. Would you prefer scenario 2b, with "[..] because its formally-specified epistemic procedure for determining the values of its samples of real humans results in a concept of value-maximization that has a hideously unfortunate implication."?
You're saying that enacting the endorsed values of real people taken at reflective equilibrium has an unfortunate implication? To whom? Surely not to the people whose values you're enacting. Which does leave population-ethics a biiiiig open question for FAI development, but it at least means the people whose values you feed to the Seed AI get what they want.
No, I'm saying that (in scenario 2b) enacting the result of a formally-specified epistemic procedure has an unfortunate implication. Unfortunate to everyone, including the people who were used as the sample against which that procedure ran.
Why? The whole point of a formally-specified epistemic procedure is that, with respect to the people taken as samples, it is right by definition.
Wonderful. Then the unfortunate implication will be right, by definition.
So what?
I'm not sure what the communication failure here is. The whole point is to construct algorithms that extrapolate the value-set of the input people. By doing so, you thus extrapolate a moral code that the input people can definitely endorse, hence the phrase "right by definition". So where is the unfortunate implication coming from?
A third-party guess: It's coming from a flaw in the formal specification of the epistemic procedure. That it is formally specified is not a guarantee that it is the specification we would want. It could rest on a faulty assumption, or take a step that appears justified but in actuality is slightly wrong.
Basically, formal specification is a good idea, but not a get-out-of-trouble-free card.
Replying elsewhere. Suffice to say, nobody would call it a "get out of trouble free" card. More like, get out of trouble after decades of prerequisite hard work, which is precisely why various forms of the hard work are being done now, decades before any kind of AGI is invented, let alone foom-flavored ultra-AI.
Reply.
I have no idea if this is the communication failure, but I certainly would agree with this comment.
Thanks!
I'm not sure either. Let me back up a little... from my perspective, the exchange looks something like this:
ialdabaoth: what if failed FAI is incorrectly implemented and fucks things up?
eli_sennesh: that won't happen, because the way we produce FAI will involve an algorithm that looks at human brains and reverse-engineers their values, which then get implemented.
theOtherDave: just because the target specification is being produced by an algorithm doesn't mean its results won't fuck things up
e_s: yes it does, because the algorithm is a formally-specified epistemic procedure, which means its results are right by definition.
tOD: wtf?
So perhaps the problem is that I simply don't understand why it is that a formally-specified epistemic procedure running on my brain to extract the target specification for a powerful optimization process should be guaranteed not to fuck things up.
Ah, ok. I'm going to have to double-reply here, and my answer should be taken as a personal perspective. This is actually an issue I've been thinking about and conversing over with an FHI guy, I'd like to hear any thoughts someone might have.
Basically, we want to extract a coherent set of terminal goals from human beings. So far, the approach to this problem is from two angles:
1) Neuroscience/neuroethics/neuroeconomics: look at how the human brain actually makes choices, and attempt to describe where and how in the brain terminal values are rooted. See: Paul Christiano's "indirect normativity" write-up.
2) Pure ethics: there are lots of impulses in the brain that feed into choice, so instead of just picking one of those, let's sit down and do the moral philosophy on how to "think out" our terminal values. See: CEV, "reflective equilibrium", "what we want to want", concepts like that.
My personal opinion is that we also need to add:
3) Population ethics: given the ability to extract values from one human, we now need to sample lots of humans and come up with an ethically sound way of combining the resulting goal functions ("where our wishes cohere rather than interfere", blah blah blah) to make an optimization metric that works for everyone, even if it's not quite maximally perfect for every single individual (that is, Shlomo might prefer everyone be Jewish, Abed might prefer everyone be Muslim, John likes being secular just fine, the combined and extrapolated goal function doesn't perform mandatory religious conversions on anyone).
Now! Here's where we get to the part where we avoid fucking things up! At least in my opinion, and as a proposal I've put forth myself, if we really have an accurate model of human morality, then we should be able to implement the value-extraction process on some experimental subjects, predictively generate a course of action through our model behind closed doors, run an experiment on serious moral decision-making, and then find afterwards that (without having seen the generated proposals before) our subjects' real decisions either matched the predicted ones, or our subjects endorse the predicted ones.
That is, ideally, we should be able to test our notion of how to epistemically describe morality before we ever make that epistemic procedure or its outputs the goal metric for a Really Powerful Optimization Process. Short of things like bugs in the code or cosmic rays, we would thus (assuming we have time to carry out all the research before $YOUR_GEOPOLITICAL_ENEMY unleashes a paper-clipper For the Evulz) have a good idea what's going to happen before we take a serious risk.
So, if I've understood your proposal, we could summarize it as:
Step 1: we run the value-extractor (seed AI, whatever) on group G and get V.
Step 2: we run a simulation of using V as the target for our optimizer.
Step 3: we show the detailed log of that simulation to G, and/or we ask G various questions about their preferences and see whether their answers match the simulation.
Step 4: based on the results of step 3, we decide whether to actually run our optimizer on V.
Have I basically understood you?
If so, I have two points, one simple and boring, one more complicated and interesting.
The simple one is that this process depends critically on our simulation mechanism being reliable. If there's a design flaw in the simulator such that the simulation is wonderful but the actual results of running our optimizer is awful, the result of this process is that we endorse a wonderful world and create a completely different awful world and say "oops."
So I still don't see how this avoids the possibility of unfortunate implications. More generally, I don't think anything we can do will avoid that possibility. We simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don't do it is even higher.
The more interesting one... well, let's assume that we do steps 1-3.
Step 4 is where I get lost. I've been stuck on this point for years.
I see step 4 going like this:
* Some members of G (G1) say "Hey, awesome, sign me up!"
* Other members of G (G2) say "I guess? I mean, I kind of thought there would be more $currently_held_sacred_value, but if your computer says this is what I actually want, well, who am I to argue with a computer?"
* G3 says "You know, that's not bad, but what would make it even better is if the bikeshed were painted yellow."
* G4 says "Wait, what? You're telling me that my values, extrapolated and integrated with everyone else's and implemented in the actual world, look like that?!? But... but... that's awful! I mean, that world doesn't have any $currently_held_sacred_value! No, I can't accept that."
* G5 says "Yeah, whatever. When's lunch?" ...and so on.
Then we stare at all that and pull out our hair. Is that a successful test? Who knows? What were we expecting, anyway... that all of G would be in G1? Why would we expect that? Even if V is perfectly correct... why would we expect mere humans to reliably endorse it?
Similarly, if we ask G a bunch of questions to elicit their revealed preferences/decisions and compare those to the results of the simulation, I expect that we'll find conflicting answers. Some things match up, others don't, some things depend on who we ask or how we ask them or whether they've eaten lunch recently.
Actually, I think the real situation is even more extreme than that. This whole enterprise depends on the idea that we have actual values... the so-called "terminal" ones, which we mostly aren't aware of right now, but are what we would want if we "learned together and grew together and yadda yadda"... which are more mutually reconcilable than the surface values that we claim to want or think we want (e.g., "everyone in the world embraces Jesus Christ in their hearts," "everybody suffers as I've suffered," "I rule the world!").
If that's true, it seems to me we should expect that the result of a superhuman intelligence optimizing the world for our terminal values and ignoring our surface values to seem alien and incomprehensible and probably kind of disgusting to the people we are right now.
And at that point we have to ask, what do we trust more? Our own brains, which say "BOO!", or the value-extractor/optimizer/simulator we've built, which says "no, really, it turns out this is what you actually want; trust me."?
If the answer to that is not immediately "we trust the software far more than our own fallible brains" we have clearly done something wrong.
But... in that case, why bother with the simulator at all? Just implement V, never mind what we think about it. Our thoughts are merely the reports of an obsolete platform; we have better tools now.
This is a special case of a more general principle: when we build tools that are more reliable than our own brains, and our brains disagree with the tools, we should ignore our own brains and obey the tools. Once a self-driving car is good enough, allowing human drivers to override it is at best unnecessary and at worst stupidly dangerous.
Similarly... this whole enterprise depends on building a machine that's better at knowing what I really want than I am. Once we've built the machine, asking me what I want is at best unnecessary and at worst stupidly dangerous.