I think this is the wrong way of looking at it. Because in this analogy, the PA is "genuinely trying their hardest to optimize for your values", it's just poor at understanding these values. That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
But an AGI that fully understood your values, would still not optimise for them if it had a bad goal. The AGI is not well-intentioned-but-weird-in-implementation; its intentions themselves are alien/weird to us.
I wish I were clearer in my title that I'm not trying to reframe all misaligned AGI's, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be "well-intentioned") if it had a bad goal.
That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
I think if we've correctly specified the values in an AGI, then I agree that when the AGI is smart enough it'll correctly optimize for our values. But it's not necessarily robust to scaling down, and I think it's likely to hit a weird place where it's trying and failing to optimize for our values. This post is about my intuitions for what that might look like.
I've curated this post for these reasons:
My biggest hesitation(s) with curating this post:
I was really excited that you wrote these posts, and learned a lot from them (plus the ensuing discussion in the comments).
I am somewhat hesitant to share simple intuition pumps about important topics, in case those intuition pumps are misleading.
This sounds wrong to me. Do you expect considering such things freely to be misleading on net? I expect some intuition pumps to be misleading, but for considering all of the intuitions that we can find about a situation to be better than avoiding them.
I feel like there are often big simplifications of complex ideas that just convey the wrong thing, and I was vaguely worried that in a field primarily dominated by things that are hard-to-read, things that are easy to understand will dominate the conversation even if they're pretty misguided. It's not a big worry for me here, but it was the biggest hesitation I had.
Not sure what Ben meant, but my own take is "sharing is fine, but intuition pumps without rigor backing them are not something we should curate regularly as an exemplar of what LW is trying to be"
In Superintelligence, Nick Bostrom talks about various "AI superpowers". One of these is "Social manipulation", which he summarizes as
Social and psychological modeling, manipulation, rhetoric persuasion
Strategic relevance:
- Leverage external resources by recruiting human support
- Enable a “boxed” AI to persuade its gatekeepers to let it out
- Persuade states and organizations to adopt some course of action
- AI can expropriate computational resources over the Internet
And Eliezer Yudkowsky writes:
There’s a popular concept of “intelligence” as book smarts, like calculus or chess, as opposed to say social skills. So people say that “it takes more than intelligence to succeed in human society”. But social skills reside in the brain, not the kidneys. When you think of intelligence, don’t think of a college professor, think of human beings; as opposed to chimpanzees. If you don’t have human intelligence, you’re not even in the game.
In order to have elite social skills, you need to be able to form accurate models about the thoughts & intentions of others. But being able to form accurate models about the thoughts & intentions of an overseer is exactly the ability we'd like to see in a corrigible AI.
If we can build AI systems that form those models without being goal-driven agents, maybe it's possible to have the benefits of elite social skills without the costs. I'm optimistic that this is the case--many of our most powerful model-building techniques don't really behave as though they have some kind of goal they are trying to achieve in the world.
I've honestly forgotten the exact original wording, but I like this one more than the thing I complained about. (The post is super short and sweet and I liked having the title be a clear handle to the idea - an "AGI reframing" is not as good a pointer as "a well-intentioned non-neurotypical super-powerful assistant".)
I think this is a clever new way of phrasing the problem.
When you said 'friend that is more powerful than you', that also made me think of a parenting relationship. We can look at whether this well-intentioned personification of AGI would be a good parent to a human child. They might be able to give the child a lot of attention, a expensive education, and a lot of material resources, but they might take unorthodox actions in the course of pursuing human goals.
I think when people imagine misaligned AGI's, they tend to imagine a superintelligent agent optimizing for something other than human values (e.g. paperclips, or a generic reward signal), and mentally picture them as adversarial or malevolent. I think this visualization isn't as applicable for AGI's trained to optimize for human approval, like act-based agents, and I'd like to present one that is.
If you've ever employed someone or had a personal assistant, you might know that the following things are consistent:
Suppose you were considering hiring a personal assistant, and you knew a few things about it:
You might consider hiring this assistant and trying really, really hard to communicate to it exactly what you want. It seems like a way better idea to just not hire this assistant. Actually, you’d probably want to run for the hills if you were forced to. Some specific failure modes you might envision:
As an illustration of the above, imagine giving an eager, brilliant, extremely non-neurotypical friend free rein to help you find a romantic partner (e.g. helping you write your OKCupid profile and setting you up on dates). As another illustration, imagine telling an entrepreneur friend that superintelligences can kill us all, and then watching him take drastic actions that clearly indicate he's missing important nuances, all while he misunderstands and dismisses concerns you raise to him. Now reimagine these scenarios with your friends drastically more powerful than you.
This is my picture of what happens by default if we construct a recursively self-improving superintelligence by having it learn from human approval. The superintelligence would not be malevolent the way a paperclip maximizer would be, but for all intents and purposes might be.