I see a lot of energy and interest being devoted toward detecting deception in AIs, trying to make AIs less deceptive, making AIs honest, etc. But I keep trying to figure out why so many think this is very important. For less-than-human intelligence, deceptive tactics will likely be caught by smarter humans (when a 5-year-old tries to lie to you, it's just sort of sad or even cute, not alarming). If an AI has greater-than-human intelligence, deception seems to be just one avenue of goal-seeking, and not even a very lucrative or efficient one.
Take the now overused humans-to-chimpanzee analogy. If humans want to bulldoze a jungle that has chimpanzees in it, they will just bulldoze the forrest, and kill or sell any chimps that get in their way. They don't say "okay, we're going to take these sticks of dynamite, and conceal them in these bundles of bananas, then we'll give the bananas to the chimps to earn their trust, and then, when the time is right, we'll detonate them." You just bulldoze the forrest and kill the chimps. Anything else is just needlessly convoluted.[1]
If an AI is smart-enough to deceive humans, and it wants to gain access to the grid, I don't see why it wouldn't just hack into the grid. Or the internet. Or server farms. Or whatever it's trying to get.
What am I missing? What situation in the future would make detecting deception in models important?
- ^
Ironically, deceptive tactics in this case would likely correlate with niceness. If you want to peacefully relocate the chimps without disturbing or scaring them, then you might use deception and manipulation. But only if you actually care about their wellbeing.
The ELK document outlines a definition of "honesty" and argues that if you achieve that level of honesty, you can scale that to a whole alignment solution.
I agree that it doesn't necessarily matter that much if an AI lies to us, vs. it just takes our resources. But in as much as you are trying to use AI systems to assist you in supervising other systems, or in providing an enhanced training signal, honesty seems like one of the most central attributes that helps you get that kind of work out of the AI, and that also allows you to take countermeasures against things the AI is plotting.
If the AI always answer honestly to the question of "are you planning to disempower me?" and "what are your plans for disempowering me?" and "how would you thwart your plans to disempower me?" then that sure makes it pretty hard for the AI to disempower you.
These questions are only relevant if the AI "plans" to disempower me. It can still be a side effect. Even something the humans at that point might endorse. Though one could conceivably work around that by asking: "Could your interaction with me lead to my disempowerment in some wider sense?"