The AGI does not need to be tricked - it knows everything about the setup, it just doesn't care. The point of this is that it allows a lot of extra control methods to be considered, if friendliness turns out to be as hard as we think.
Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it's easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.
I'm nervous about designing elaborate mechanisms to trick an AGI, since if we can't even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we'd implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.
As I understand it, your satisficing agent has essentially the utility function min(E[paperclips], 9). This means it would be fine with a 10^-100 chance of producing 10^101 paperclips. But isn't it more intuitive to think of a satisficer as optimizing the utility function E[min(paperclips, 9)]? In this case, the satisficer would reject the 10^-100 gamble described above, in favor of just producing 9 paperclips (whereas a maximizer would still take the gamble and hence would be a poor replacement for the satisficer).
A satisficer might not want to take over the world, since doing that would arouse opposition and possibly lead to its defeat. Instead, the satisficer might prefer to request very modest demands that are more likely to be satisfied (whether by humans or by an ascending uncontrolled AI who wants to mollify possible opponents).
But what about when people learn about the setup of this particular problem? Does the correlation between having the one-boxing gene and inclining toward one-boxing still hold?
Yes, it should also hold in this case. Knowing about the study could be part of the problem and the subjects of the initial study could be lied to about a study. The idea of the "genetic Newcomb problem" is that the two-boxing gene is less intuitive than CGTA and that its workings are mysterious. It could make you be sure that you have or don't have the gene. It could make be comfortable with decision theories whose names start with 'C', interpret genetical Newcomb problem studies in a certain way etc. The only thing that we know is that is causes us to two-box, in the end. For CGTA, on the other hand, we have a very strong intuition that it causes a "tickle" or so that could be easily overridden by us knowing about the first study (which correlates chewing gum with throat abscesses). It could not possibly influence what we think about CDT vs. EDT etc.! But this intuition is not part of the original description of the problem.
If there were a perfect correlation between choosing to one-box and having the one-box gene (i.e., everyone who one-boxes has the one-box gene, and everyone who two-boxes has the two-box gene, in all possible circumstances), then it's obvious that you should one-box, since that implies you must win more. This would be similar to the original Newcomb problem, where Omega also perfectly predicts your choice. Unfortunately, if you really will follow the dictates of your genes under all possible circumstances, then telling someone what she should do is useless, since she will do what her genes dictate.
The more interesting and difficult case is when the correlation between gene and choice isn't perfect.
But what about when people learn about the setup of this particular problem? Does the correlation between having the one-boxing gene and inclining toward one-boxing still hold?
Yes, it should also hold in this case. Knowing about the study could be part of the problem and the subjects of the initial study could be lied to about a study. The idea of the "genetic Newcomb problem" is that the two-boxing gene is less intuitive than CGTA and that its workings are mysterious. It could make you be sure that you have or don't have the gene. It could make be comfortable with decision theories whose names start with 'C', interpret genetical Newcomb problem studies in a certain way etc. The only thing that we know is that is causes us to two-box, in the end. For CGTA, on the other hand, we have a very strong intuition that it causes a "tickle" or so that could be easily overridden by us knowing about the first study (which correlates chewing gum with throat abscesses). It could not possibly influence what we think about CDT vs. EDT etc.! But this intuition is not part of the original description of the problem.
(moved comment)
I assume that the one-boxing gene makes a person generically more likely to favor the one-boxing solution to Newcomb. But what about when people learn about the setup of this particular problem? Does the correlation between having the one-boxing gene and inclining toward one-boxing still hold? Are people who one-box only because of EDT (even though they would have two-boxed before considering decision theory) still more likely to have the one-boxing gene? If so, then I'd be more inclined to force myself to one-box. If not, then I'd say that the apparent correlation between choosing one-boxing and winning breaks down when the one-boxing is forced. (Note: I haven't thought a lot about this and am still fairly confused on this topic.)
I'm reminded of the problem of reference-class forecasting and trying to determine which reference class (all one-boxers? or only grudging one-boxers who decided to one-box because of EDT?) to apply for making probability judgments. In the limit where the reference class consists of molecule-for-molecule copies of yourself, you should obviously do what made the most of them win.
The problem I see here is that if you literally care only about the "worst life at any given moment", then situations "seven billion extremely happy people, one mildly unhappy person" and "seven billion mildly hapy people, one mildly unhappy person" are equivalent, because the worst one is in the same situation. Which means, if you had a magical button that could convert the latter situation to the former, you wouldn't bother pressing it, because you wouldn't see a point in doing so. Is that what you really believe?
Good point. Also, in most multiverse theories, the worst possible experience necessarily exists somewhere.
Okay. I'm sure you've seen this question before, but I'm going to ask it anyway.
Given a choice between
- A world with seven billion mildly happy people, or
- A world with seven billion minus one really happy people, and one person who just got a papercut
Are you really going to choose the former? What's your reasoning?
From a practical perspective, accepting the papercut is the obvious choice because it's good to be nice to other value systems.
Even if I'm only considering my own values, I give some intrinsic weight to what other people care about. ("NU" is just an approximation of my intrinsic values.) So I'd still accept the papercut.
I also don't really care about mild suffering -- mostly just torture-level suffering. If it were 7 billion really happy people plus 1 person tortured, that would be a much harder dilemma.
In practice, the ratio of expected heaven to expected hell in the future is much smaller than 7 billion to 1, so even if someone is just a "negative-leaning utilitarian" who cares orders of magnitude more about suffering than happiness, s/he'll tend to act like a pure NU on any actual policy question.
View more: Next
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
One naive and useful security precaution is to only make the AI care about world where the high explosives inside it won't actually ever detonate... (and place someone ready to blow them up if the AI misbehaves).
There are other, more general versions of that idea, and other uses to which this can be put.
I guess you mean that the AGI would care about worlds where the explosives won't detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn't detonate for any reason, it would try hard to stop the button from being pushed.
But to make the AGI care about only worlds where the bomb doesn't go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to "try to avert the explosion" vs. just doing ordinary actions. That gets pretty tricky pretty quickly.
Anyway, you've convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.