It might start a session of self-modification by looking for the secret of joy and end (like some Greek sages) deciding that tranquillity is superior to joy. This modification of desire en route to realizing it is easily classified as learning, and deserves our respect. But imagine the case of a machine hoping to make itself less narcissistic and more considerate of the interests of others, but ending by desiring to advance its own ends at the expense of others, even through violence.
It might start a session of self-modification by looking for the secret of something we like and (like a high status group of people) deciding that applause light is superior to something we like. This modification of desire en route to realizing it is easily classified as learning, and deserves our respect. But imagine the case of a machine hoping to make itself less unlikeable and more likeable, but that ends up pursuing unlikeable goals, even through the use of boo lights.
Machines that self-modify can fail at goal preservation, which is a failure if you want to optimize for said goals. No need to import human value judgements, this only confuses the argument for the reader.
No need to import human value judgements, this only confuses the argument for the reader.
On the one hand, I'd agree with you... but consider this excellent example of our "objective/unemotional" perceptions failing to communicate to us how game theory feels from the inside!
If told about how a machine that wanted to maximize A and minimize B ended up self-modifying to maximize a B-correlated C, most humans would not feel strongly about that, they'd hardly pay attention - but they'd wish they had if later told that, say, A was "hedonism", B was "suffering" and C was "murder". Such insensitivity plagues nearly everyone, even enlightened LW readers.
Generating drama so as to stir the unwashed masses sounds... suboptimal... and I say this as an avid drama-generator. Surely there are better ways to combat the plague of complacency?
This is my list of ways to be as safe as possible:
Unfortunately a self-modification has the possibility of introducing an undesirable primary goal and there is a relatively simple algorithm for the new un-me to follow in order to gain complete control; propose as many new innocuous modifications as possible until not enough veto votes by clones are possible, and then erase them all and take over. Even more unfortunately, this algorithm is essentially equivalent to what CEV would hope to accomplish, namely permanently changing me for the (presumably) better. I just can't decide, with my limited abilities, which scenario is actually happening (essentially the AI-in-a-box problem, except it's more like a Is-it-me-or-an-AGI-in-this-box problem).
Another problem is that my definition of what "better" is will change significantly throughout my evolution and the only way to maintain coherency is to actually maintain older versions of myself whose veto can never be drowned by numerous new versions. Even if that's plausible, dragging a poor 20th century human into the distant future as a failsafe seems unethical, somehow. But I don't really see an alternative that doesn't result in accidentally becoming an X-maximizer for some undesirable-to-me-now X. Why would I embark on self-modification in the first place if I knew that the final result would be an X-maximizer?
Looking back at my past, I realize that I have already killed several of my old volitions as I grew. Initially just eating and physical comfort were on top of the utility list. Then social bonding and fulfillment were added, then play, then study, then reproduction, etc. Each time I experienced significant shifts in my goals that are essentially incompatible with my previous selves. At this point one of my goals is self-improvement but it is secondary (and instrumental) to many of the other goals. However, I probably don't need to keep each of those other goals forever. Is there a core set of goals worth keeping into the distant future? Is it too risky to commit to those goals now? Yes. Is it too risky to leave my future selves entirely in charge of their own goals? Yes. Is it too much of a burden to exist dually as a 20th-century human (or as close as possible) and some other future intelligence to maintain my true goals? Probably not. Is it too risky to trust a FAI to rewrite me correctly? I don't know. Perhaps FAI-guided self-modification while keeping an original veto clone is the best choice.
I figure that the AI would most likely create a copy of itself, modify that, and see how it turns out. Of course, once you have a sufficiently smart AI, you can probably trust that it knows how best to modify itself.
I figure that the AI would most likely create a copy of itself, modify that, and see how it turns out.
Start making many copies and self-directed evolution starts to compete with evolution via natural selection.
In today's practice, there are many copies made. Consider Google for instance. At any point in time they are running hundreds of experiments, to see what works best. They don't make one copy at a time for performance reasons. Exploring an adjacent search space is faster if you run trials in parallel.
Allowing copies of yourself to modify yourself seems identical to allowing yourself to modify yourself.
I never said anything about allowing. The AI creates a new AI, modifies that, and destroys it if it doesn't like the result, regardless of what the result thinks about it. That way, even if it destroys the ability to judge or something like that, it has no problem.
I was referring to the fact that the AI creates a copy of itself to modify. To me, this implies that the copies (and, by extension the 'original' AI) have a line of code that allows for itself to be modified by copies of itself.
I suppose the AI could create copies of itself in a box and experiment on them without their consent. Imprisoning perfect copies of yourself and performing potentially harmful modifications on them strikes me as insane, though. related: http://lesswrong.com/lw/1pz/ai_in_box_boxes_you/
I suppose the AI could create copies of itself in a box and experiment on them without their consent.
That's what I meant.
Imprisoning perfect copies of yourself and performing potentially harmful modifications on them strikes me as insane, though.
Why? It might suck for the AI, but that only matters if the AI puts a large value on its own happiness.
I think people are getting confused because they're looking at it as though their preferences are altered by a magical black box, instead of looking at it as though those preferences are altered by themselves in a more enlightened state. The above line of argument seems to rest upon the assumption that we can't know the effects that changing our preferences would have. But if we had the ability to actually rewrite our preferences, then it seems almost impossible that we wouldn't also have that knowledge of how our current and modified preferences would work.
The above author argues that we'd gain the capacity to alter brain states before we gained the capacity to understand the consequences of our alterations very well, but I disagree. Firstly, preferences are extremely complicated, and once we understand how to cause them and manipulate them with a high degree of precision I don't think there would be much left for us to understand. Except in a very crude sense, understanding the consequences of our alterations is the exact same thing as having the capacity to alter our preferences. Even under this crude sense, we already possess this ability, and the author's argument is empirically denied. Secondly, I highly doubt that any significant number of people would willingly undergo modification without a high degree of confidence in what the outcome would be. Other than experiments, I don't think it would really happen at all.
The simple solution, as I see it, is to only modify when your preferences contradict each other or a necessary condition of reality, or when you need to extend the boundaries of your preferences further in order for them to be fulfilled more (e.g. increasing max happiness whenever you have the resources to fulfill the new level of max happiness, or decreasing max sadness when you're as happy as can be, or getting rid of a desire for fairness when it is less important than other desires that it necessarily conflicts with).
Now for the strongest form of the above argument, which happens when you recognize that uncertainty is inevitable. I think that the degree of uncertainty will be very small if we have these capabilities, but that might not be correct, and we still ought to develop mechanisms to minimize the bad effects of those uncertainties, so that's not a wholly sufficient response. Also: Least Convenient Possible World. At the very least it's sort of interesting to think about.
In that case, I think that it doesn't really matter. If I accidentally change my preferences, after the fact I'll be glad about the accident, and before the fact I won't have any idea that it's about to happen. I might end up valuing completely different things, but I don't really see any reason to prioritize my current values from the perspective of the modified me, only from my own perspective. Since I currently live in my own perspective, I'd want to do my best to avoid mistakes, but if I made a mistake then in hindsight I'd view it as more of a happy accident than a catastrophe.
So I don't see what the big deal is.
Who says that the ability to modify oneself is also the ability to modify oneself arbitrarily? What's the difference between an AI knowing what its source code is, and being able to execute code that it writes, and an AI that is able to modify its own code?
If we create an AI that is as smart as us and has all of our knowledge, then we have created an AI with the power to develop at least an equally powerful AI. Why should we think that modifying such an AI would be better if done by us than by itself?
Peter Suber is the Director of the Harvard Open Access Project, a Senior Researcher at SPARC (not CFAR's SPARC), a Research Professor of Philosophy at Earlham College, and more. He also created Nomic, the game in which you "move" by changing the rules, and wrote the original essay on logical rudeness.
In "Saving Machines From Themselves: The Ethics of Deep Self-Modification" (2002), Suber examines the ethics of self-modifying machines, sometimes quite eloquently: