Why only replace it if it causes more of what you currently care about? Why not just replace it if it causes you to have more of what you will care about.
Because I care about what I care about, and I don't care about what I don't care about.
Sure, this is loyalty in a sense... not loyalty to the sources of my utility function -- heck, I might not even know what those are -- but to the function itself. (It seems a little odd to talk about being loyal to my own preferences, but not intolerably odd.)
The fact that something I don't care about might be something I care about in the future is, admittedly, relevant. If I knew that a year from now my utility function would change such that I started really valuing people knowing Portuguese, I might start devoting some time and effort now to encouraging people to learn Portuguese (perhaps starting by learning it myself), in anticipation of appreciating having done so in a year. It wouldn't be a strong impulse, but it would be present.
But that depends a lot on my confidence in that actually happening.
If I knew instead that I could press a button in a year and start really valuing people learning Portuguese, I probably wouldn't devote resources to encouraging people to learn it, because I'd expect that I'd never press the button. Why should I? It gets me nothing I want.
In the scenario you are considering, I know I can press a button and start really valuing anything I choose. Or start valuing random things, for that matter, without having to choose them. Agreed.
But so what? Why should I press a button that makes me care about things that I don't consider worth caring about?
"But you would consider them worth caring about if you pressed the button!" Well, yes, that's true. I would speak French if I lived in France for the next few years, but the truth of that doesn't help me understand French sentences. I would want X if I edited my utility function to value X highly, but the truth of that doesn't help me want X. There's an important difference between actuals and hypotheticals.
I realize I was making the assumption that the entity choosing which values to have would value 'maximally' satisfying those values in some sense, so that if it could freely choose it would choose values that were easy or best to satisfy. But this isn't necessarily so. It's humans that have lots of values about their values, and we would have a tough time, I think, choosing our values if we could choose. Perhaps there is dynamic tension between our values (we want our values to have value, and we are constantly asking ourselves what our goals should be and...
Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/
Suzanne Gildert basically argues that any AGI that can considerably self-improve would simply alter its reward function directly. I'm not sure how she arrives at the conclusion that such an AGI would likely switch itself off. Even if an abstract general intelligence would tend to alter its reward function, wouldn't it do so indefinitely rather than switching itself off?
If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:
Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/
I am not sure if that argument would apply here. I suppose the AI might hit diminishing returns but could again alter its reward function to prevent that, though what would be the incentive for doing so?
ETA:
I left a comment over there:
ETA #2:
What else I wrote: