Let's start with the template for an AGI, the seed for a generally intelligent expected-utility maximizer capable of recursive self-improvement.
As far as I can tell, the implementation of such a template would do nothing at all because its utility-function would be a "blank slate".
What happens if you now enclose the computation of Pi in its utility-function? Would it reflect on this goal and try to figure out its true goals? Why would it do so, where does the incentive come from?
Would complex but implicit goals change its behavior? Why would it improve upon its goals, why would it even try to preserve them in their current form if it has no explicit incentive to do so? It seems that if it indeed has an incentive to make its goals explicit, given an implicit utility-function, then the incentive to do so must be a presupposition inherent to the definition of a generally intelligent expected-utility maximizer capable of recursive self-improvement.
What happens if you now enclose the computation of Pi in its utility-function? Would it reflect on this goal and try to figure out its true goals? Why would it do so, where does the incentive come from?
So: the general story is that to be able to optimise, agents have to build a model of the world - in order to predict the consequences of their possible actions. That model of the world will necessarily include a model of the agent - since it is an important part of its own local environment. That model of itself is likely to include its own goals - and it will use Occam's razor to build a neat model of them. Thus goal reflection - Q.E.D.
I have stopped understanding why these quotes are correct. Help!
More specifically, if you design an AI using "shallow insights" without an explicit goal-directed architecture - some program that "just happens" to make intelligent decisions that can be viewed by us as fulfilling certain goals - then it has no particular reason to stabilize its goals. Isn't that anthropomorphizing? We humans don't exhibit a lot of goal-directed behavior, but we do have a verbal concept of "goals", so the verbal phantom of "figuring out our true goals" sounds meaningful to us. But why would AIs behave the same way if they don't think verbally? It looks more likely to me that an AI that acts semi-haphazardly may well continue doing so even after amassing a lot of computing power. Or is there some more compelling argument that I'm missing?