I think when people imagine misaligned AGI's, they tend to imagine a superintelligent agent optimizing for something other than human values (e.g. paperclips, or a generic reward signal), and mentally picture them as adversarial or malevolent. I think this visualization isn't as applicable for AGI's trained to optimize for human approval, like act-based agents, and I'd like to present one that is.
If you've ever employed someone or had a personal assistant, you might know that the following things are consistent:
- The employee or assistant is genuinely trying their hardest to optimize for your values. They're trying to understand what you want as much as they can, asking you for help when things are unclear, not taking action until they feel like their understanding is adequate, etc.
- They follow your instructions literally, under a sensible-to-them-seeming interpretation completely different from your own, and screw up the task entirely.
Suppose you were considering hiring a personal assistant, and you knew a few things about it:
- Your assistant was raised in a culture completely different from your own.
- Your assistant is extremely non-neurotypical. It doesn't have an innate sense of pain or empathy or love, it's a savant at abstract reasoning, and it learned everything it knows about the world (including human values) from Wikipedia.
- Your assistant is in a position where it has access to enormous amounts of resources, and could easily fool you or overpower you if it decided to.
You might consider hiring this assistant and trying really, really hard to communicate to it exactly what you want. It seems like a way better idea to just not hire this assistant. Actually, you’d probably want to run for the hills if you were forced to. Some specific failure modes you might envision:
- Your assistant's understanding of your values will be weird and off, perhaps in ways that are hard to communicate or even pin down.
- Your assistant might reason in a way that looks convoluted and obviously wrong to you, while looking natural and obviously correct to it, leading it to happily take actions you'd consider catastrophic.
As an illustration of the above, imagine giving an eager, brilliant, extremely non-neurotypical friend free rein to help you find a romantic partner (e.g. helping you write your OKCupid profile and setting you up on dates). As another illustration, imagine telling an entrepreneur friend that superintelligences can kill us all, and then watching him take drastic actions that clearly indicate he's missing important nuances, all while he misunderstands and dismisses concerns you raise to him. Now reimagine these scenarios with your friends drastically more powerful than you.
This is my picture of what happens by default if we construct a recursively self-improving superintelligence by having it learn from human approval. The superintelligence would not be malevolent the way a paperclip maximizer would be, but for all intents and purposes might be.
In Superintelligence, Nick Bostrom talks about various "AI superpowers". One of these is "Social manipulation", which he summarizes as
And Eliezer Yudkowsky writes:
In order to have elite social skills, you need to be able to form accurate models about the thoughts & intentions of others. But being able to form accurate models about the thoughts & intentions of an overseer is exactly the ability we'd like to see in a corrigible AI.
If we can build AI systems that form those models without being goal-driven agents, maybe it's possible to have the benefits of elite social skills without the costs. I'm optimistic that this is the case--many of our most powerful model-building techniques don't really behave as though they have some kind of goal they are trying to achieve in the world.