"So how can it ensure that future self-modifications will accomplish its current objectives? For one thing, it has to make those objectives clear to itself. If its objectives are only implicit in the structure of a complex circuit or program, then future modifications are unlikely to preserve them. Systems will therefore be motivated to reflect on their goals and to make them explicit." -- Stephen M. Omohundro, The Basic AI Drives
This AI becomes able to improve itself in a haphazard way, makes various changes that are net improvements but may introduce value drift, and then gets smart enough to do guaranteed self-improvement, at which point its values freeze (forever). -- Eliezer Yudkowsky, What I Think, If Not Why
I have stopped understanding why these quotes are correct. Help!
More specifically, if you design an AI using "shallow insights" without an explicit goal-directed architecture - some program that "just happens" to make intelligent decisions that can be viewed by us as fulfilling certain goals - then it has no particular reason to stabilize its goals. Isn't that anthropomorphizing? We humans don't exhibit a lot of goal-directed behavior, but we do have a verbal concept of "goals", so the verbal phantom of "figuring out our true goals" sounds meaningful to us. But why would AIs behave the same way if they don't think verbally? It looks more likely to me that an AI that acts semi-haphazardly may well continue doing so even after amassing a lot of computing power. Or is there some more compelling argument that I'm missing?
Part of the problem, it appears to me, is that you're ascribing a verbal understanding to a mechanical process. Consider; for AIs to have values those values must be 'stored' in a medium compatible with their calculations.
However, once an AI begins to 'improve' itself -- that is, once an AI has as an available "goal" the ability to form better goals -- then it's going to base the decisions of what an improved goal is based on the goals and values it already has. This will cause it to 'stabilize' upon a specific set of higher-order values / goals.
Once the AI "decides" that becoming a better paperclip maker is something it values, it is going to value valuing making itself a better paperclip optimizer recursively in a positive feedback loop that will then anchor upon a specific position.
This can, quite easily, be expressed in mathematical / computational terms -- though I am insufficient to the task of doing so.
A different way of viewing it is that once intentionality is introduced to assigning value, assigning value has an assigned value. Recursion of goal-orientation can then be viewed to produce a 'gravity' in then-existing values.
EDIT: To those of you downvoting, would you care to explain what you disagree with that is causing you to do so?
I am skeptical of this claim. I'm not at all convinced that it's feasible to formalize "goal" or that if we could formalize it, the claim would be true in general. Software is awfully general, and I can easily imagine a system that has some sort of constraint on its self-modification, where that constraint can't be self-modified away. I can also imagine a system that doesn't have an explicit constraint on its evolution... (read more)