I see a lot of energy and interest being devoted toward detecting deception in AIs, trying to make AIs less deceptive, making AIs honest, etc. But I keep trying to figure out why so many think this is very important. For less-than-human intelligence, deceptive tactics will likely be caught by smarter humans (when a 5-year-old tries to lie to you, it's just sort of sad or even cute, not alarming). If an AI has greater-than-human intelligence, deception seems to be just one avenue of goal-seeking, and not even a very lucrative or efficient one.
Take the now overused humans-to-chimpanzee analogy. If humans want to bulldoze a jungle that has chimpanzees in it, they will just bulldoze the forrest, and kill or sell any chimps that get in their way. They don't say "okay, we're going to take these sticks of dynamite, and conceal them in these bundles of bananas, then we'll give the bananas to the chimps to earn their trust, and then, when the time is right, we'll detonate them." You just bulldoze the forrest and kill the chimps. Anything else is just needlessly convoluted.[1]
If an AI is smart-enough to deceive humans, and it wants to gain access to the grid, I don't see why it wouldn't just hack into the grid. Or the internet. Or server farms. Or whatever it's trying to get.
What am I missing? What situation in the future would make detecting deception in models important?
- ^
Ironically, deceptive tactics in this case would likely correlate with niceness. If you want to peacefully relocate the chimps without disturbing or scaring them, then you might use deception and manipulation. But only if you actually care about their wellbeing.
It seems likely to me that there will be a regime where we have transformatively useful AI which has an ability profile that isn't wildly different than that of humans in important domains. Improving the situation in this regime, without necessarily directly solving problems due to wildly superhuman AI, seems pretty worthwhile. We could potentially use these transformatively useful AIs for a wide variety of tasks which could make the situation much better.
Merely human-ish level AIs which run fast, are cheap, and are "only" as smart as quite good human scientists/engineers could be used to radically improve the situation if we could safely and effectively utilize these systems. (Being able to safely and effectively utilize these systems doesn't seem at all guaranteed in my view, so it seems worth working on.)