Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?
Both quotes are from your above post. Apologies for confusion.
“A sufficiently intelligent agent will try to prevent its goals[1] from changing, at least if it is consequentialist.”
It seems that in humans, smarter people are more able and likely to change their goals. A smart person may change his/her views about how the universe can best be arranged upon reading Nick Bostrom’s book Deep Utopia, for example.
‘I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.“
A human may change from wanting to help people to not wanting to help people if he/she got 5 hours of sleep instead of 8.
How do humans, for example, read a philosophy book and update their views about what they value about the world?
“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.
But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”
Can someone please why trading partners would lobotomize themselves?
How does inner misalignment lead to paperclips? I understand the comparison of paperclips to ice cream, and that after some threshold of intelligence is reached, then new possibilities can be created that satisfy desires better than anything in the training distribution, but humans want to eat ice cream, not spread the galaxies with it. So why would the AI spread the galaxies with paperclips, instead of create them and
”consume“ them? Please correct any misunderstandings of mine,
And a subset might value drift towards optimizing the internal experiences of all conscious minds?
If an AGI achieves consciousness, why would its values not drift towards optimizing its own internal experience, and away from tiling the lightcone with something?
How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?
What about from a quantum immortality perspective?