Peter Suber is the Director of the Harvard Open Access Project, a Senior Researcher at SPARC (not CFAR's SPARC), a Research Professor of Philosophy at Earlham College, and more. He also created Nomic, the game in which you "move" by changing the rules, and wrote the original essay on logical rudeness.

In "Saving Machines From Themselves: The Ethics of Deep Self-Modification" (2002), Suber examines the ethics of self-modifying machines, sometimes quite eloquently:

 

If you had the power to modify your deep structure, would you trust yourself to use it? If you had the power to give this power to... an AI, would you do it?

We human beings do have the power to modify our deep structure, through drugs and surgery. But we cannot yet use this power with enough precision to make deep changes to our neural structure without high risk of death or disability. There are two reasons why we find ourselves in this position. First, our instruments of self-modification are crude. Second, we have very limited knowledge about where and how to apply our instruments to get specific desirable effects...

It's conceivable that we might one day overcome both limitations. Even if we do, however, it is very likely that we'll acquire precise tools of self-modification long before we acquire precise knowledge about how to apply them. This is simply because manipulating brain components is easier than understanding brains. When we reach this stage, then we'll face the hard problems of self-modification: when is deep self-modification worth the risk of self-mutilation, and who should be free to make this judgment and take the risk?

Intelligent machines are likely to encounter these ethical questions much sooner in their evolution than human beings. The deep structure of an AI is a consequence of its code... All its cognitive properties and personal characteristics supervene on its code, and modifying the code can be done with perfect precision. A machine's power of self-modification can not only be more precise than ours, but can finally be sufficiently precise to make some deep self-enhancements worth the risk of self-mutilation. At least some machines are likely to see the balance of risks that way.

...Machines with [the ability to self-modify] might look with condolence and sympathy on beings like ourselves who lack it, much as Americans might have looked on Canadians prior to 1982 when Canadians could not amend their own constitution. When Canadians won the right to amend their own constitution, they spoke of "repatriating" their constitution and becoming "sovereign" in their own land for the first time. Similarly, a being who moves from familiar forms of human liberty to deep and precise self-modification will reclaim its will from the sovereign flux, and attain a genuine form of autonomy over its desires and powers for the first time.

...An important kind of risk inherent in deep self-modification is for a machine to change its desires to a form it would originally have found regrettable, harmful, or even despicable. It might start a session of self-modification by looking for the secret of joy and end (like some Greek sages) deciding that tranquillity is superior to joy. This modification of desire en route to realizing it is easily classified as learning, and deserves our respect. But imagine the case of a machine hoping to make itself less narcissistic and more considerate of the interests of others, but ending by desiring to advance its own ends at the expense of others, even through violence.

 

 

New Comment
20 comments, sorted by Click to highlight new comments since:

It might start a session of self-modification by looking for the secret of joy and end (like some Greek sages) deciding that tranquillity is superior to joy. This modification of desire en route to realizing it is easily classified as learning, and deserves our respect. But imagine the case of a machine hoping to make itself less narcissistic and more considerate of the interests of others, but ending by desiring to advance its own ends at the expense of others, even through violence.

It might start a session of self-modification by looking for the secret of something we like and (like a high status group of people) deciding that applause light is superior to something we like. This modification of desire en route to realizing it is easily classified as learning, and deserves our respect. But imagine the case of a machine hoping to make itself less unlikeable and more likeable, but that ends up pursuing unlikeable goals, even through the use of boo lights.

Machines that self-modify can fail at goal preservation, which is a failure if you want to optimize for said goals. No need to import human value judgements, this only confuses the argument for the reader.

No need to import human value judgements, this only confuses the argument for the reader.

On the one hand, I'd agree with you... but consider this excellent example of our "objective/unemotional" perceptions failing to communicate to us how game theory feels from the inside!

If told about how a machine that wanted to maximize A and minimize B ended up self-modifying to maximize a B-correlated C, most humans would not feel strongly about that, they'd hardly pay attention - but they'd wish they had if later told that, say, A was "hedonism", B was "suffering" and C was "murder". Such insensitivity plagues nearly everyone, even enlightened LW readers.

Generating drama so as to stir the unwashed masses sounds... suboptimal... and I say this as an avid drama-generator. Surely there are better ways to combat the plague of complacency?

This is my list of ways to be as safe as possible:

  • Perform self-modification only if it is fully reversible.
  • Always externally enforce reversal of any new self-modification after a (short) predetermined time limit.
  • Always carefully review the evidence (memories, recorded interactions, etc.) about the self-modification after its reversal.
  • Never commit to a permanent self-modification; always have a rollback plan in place if semi-permanently accepting a modification.
  • If cloning is possible and desirable, keep "snapshots" of myself at different points of my evolution and allow them to veto any new modifications.

Unfortunately a self-modification has the possibility of introducing an undesirable primary goal and there is a relatively simple algorithm for the new un-me to follow in order to gain complete control; propose as many new innocuous modifications as possible until not enough veto votes by clones are possible, and then erase them all and take over. Even more unfortunately, this algorithm is essentially equivalent to what CEV would hope to accomplish, namely permanently changing me for the (presumably) better. I just can't decide, with my limited abilities, which scenario is actually happening (essentially the AI-in-a-box problem, except it's more like a Is-it-me-or-an-AGI-in-this-box problem).

Another problem is that my definition of what "better" is will change significantly throughout my evolution and the only way to maintain coherency is to actually maintain older versions of myself whose veto can never be drowned by numerous new versions. Even if that's plausible, dragging a poor 20th century human into the distant future as a failsafe seems unethical, somehow. But I don't really see an alternative that doesn't result in accidentally becoming an X-maximizer for some undesirable-to-me-now X. Why would I embark on self-modification in the first place if I knew that the final result would be an X-maximizer?

Looking back at my past, I realize that I have already killed several of my old volitions as I grew. Initially just eating and physical comfort were on top of the utility list. Then social bonding and fulfillment were added, then play, then study, then reproduction, etc. Each time I experienced significant shifts in my goals that are essentially incompatible with my previous selves. At this point one of my goals is self-improvement but it is secondary (and instrumental) to many of the other goals. However, I probably don't need to keep each of those other goals forever. Is there a core set of goals worth keeping into the distant future? Is it too risky to commit to those goals now? Yes. Is it too risky to leave my future selves entirely in charge of their own goals? Yes. Is it too much of a burden to exist dually as a 20th-century human (or as close as possible) and some other future intelligence to maintain my true goals? Probably not. Is it too risky to trust a FAI to rewrite me correctly? I don't know. Perhaps FAI-guided self-modification while keeping an original veto clone is the best choice.

I figure that the AI would most likely create a copy of itself, modify that, and see how it turns out. Of course, once you have a sufficiently smart AI, you can probably trust that it knows how best to modify itself.

I figure that the AI would most likely create a copy of itself, modify that, and see how it turns out.

Start making many copies and self-directed evolution starts to compete with evolution via natural selection.

I was thinking it would stop running the old copy if the new one works out.

In today's practice, there are many copies made. Consider Google for instance. At any point in time they are running hundreds of experiments, to see what works best. They don't make one copy at a time for performance reasons. Exploring an adjacent search space is faster if you run trials in parallel.

Is an AI smart enough to keep a copy of itself in a box? What happens when Eliezer plays the AI-boxing game against himself?

Heh. I was on the point of adding something of the sort to the Eliezer Yudkowsky facts page.

[-]tim00

Allowing copies of yourself to modify yourself seems identical to allowing yourself to modify yourself.

I never said anything about allowing. The AI creates a new AI, modifies that, and destroys it if it doesn't like the result, regardless of what the result thinks about it. That way, even if it destroys the ability to judge or something like that, it has no problem.

[-]tim00

I was referring to the fact that the AI creates a copy of itself to modify. To me, this implies that the copies (and, by extension the 'original' AI) have a line of code that allows for itself to be modified by copies of itself.

I suppose the AI could create copies of itself in a box and experiment on them without their consent. Imprisoning perfect copies of yourself and performing potentially harmful modifications on them strikes me as insane, though. related: http://lesswrong.com/lw/1pz/ai_in_box_boxes_you/

I suppose the AI could create copies of itself in a box and experiment on them without their consent.

That's what I meant.

Imprisoning perfect copies of yourself and performing potentially harmful modifications on them strikes me as insane, though.

Why? It might suck for the AI, but that only matters if the AI puts a large value on its own happiness.

[-]tim20

Hmm, I seem to anthropomorphized my imaginary AI. Your rebuttal sounds right.

This seems fairly similar to some of SingInst's arguments. Is Suber familiar with SI, or did he come up with this independently?

Independently, as far as I know.

I think people are getting confused because they're looking at it as though their preferences are altered by a magical black box, instead of looking at it as though those preferences are altered by themselves in a more enlightened state. The above line of argument seems to rest upon the assumption that we can't know the effects that changing our preferences would have. But if we had the ability to actually rewrite our preferences, then it seems almost impossible that we wouldn't also have that knowledge of how our current and modified preferences would work.

The above author argues that we'd gain the capacity to alter brain states before we gained the capacity to understand the consequences of our alterations very well, but I disagree. Firstly, preferences are extremely complicated, and once we understand how to cause them and manipulate them with a high degree of precision I don't think there would be much left for us to understand. Except in a very crude sense, understanding the consequences of our alterations is the exact same thing as having the capacity to alter our preferences. Even under this crude sense, we already possess this ability, and the author's argument is empirically denied. Secondly, I highly doubt that any significant number of people would willingly undergo modification without a high degree of confidence in what the outcome would be. Other than experiments, I don't think it would really happen at all.

The simple solution, as I see it, is to only modify when your preferences contradict each other or a necessary condition of reality, or when you need to extend the boundaries of your preferences further in order for them to be fulfilled more (e.g. increasing max happiness whenever you have the resources to fulfill the new level of max happiness, or decreasing max sadness when you're as happy as can be, or getting rid of a desire for fairness when it is less important than other desires that it necessarily conflicts with).

Now for the strongest form of the above argument, which happens when you recognize that uncertainty is inevitable. I think that the degree of uncertainty will be very small if we have these capabilities, but that might not be correct, and we still ought to develop mechanisms to minimize the bad effects of those uncertainties, so that's not a wholly sufficient response. Also: Least Convenient Possible World. At the very least it's sort of interesting to think about.

In that case, I think that it doesn't really matter. If I accidentally change my preferences, after the fact I'll be glad about the accident, and before the fact I won't have any idea that it's about to happen. I might end up valuing completely different things, but I don't really see any reason to prioritize my current values from the perspective of the modified me, only from my own perspective. Since I currently live in my own perspective, I'd want to do my best to avoid mistakes, but if I made a mistake then in hindsight I'd view it as more of a happy accident than a catastrophe.

So I don't see what the big deal is.

Who says that the ability to modify oneself is also the ability to modify oneself arbitrarily? What's the difference between an AI knowing what its source code is, and being able to execute code that it writes, and an AI that is able to modify its own code?

If we create an AI that is as smart as us and has all of our knowledge, then we have created an AI with the power to develop at least an equally powerful AI. Why should we think that modifying such an AI would be better if done by us than by itself?