I see a lot of energy and interest being devoted toward detecting deception in AIs, trying to make AIs less deceptive, making AIs honest, etc. But I keep trying to figure out why so many think this is very important. For less-than-human intelligence, deceptive tactics will likely be caught by smarter humans (when a 5-year-old tries to lie to you, it's just sort of sad or even cute, not alarming). If an AI has greater-than-human intelligence, deception seems to be just one avenue of goal-seeking, and not even a very lucrative or efficient one.

Take the now overused humans-to-chimpanzee analogy. If humans want to bulldoze a jungle that has chimpanzees in it, they will just bulldoze the forrest, and kill or sell any chimps that get in their way. They don't say "okay, we're going to take these sticks of dynamite, and conceal them in these bundles of bananas, then we'll give the bananas to the chimps to earn their trust, and then, when the time is right, we'll detonate them." You just bulldoze the forrest and kill the chimps. Anything else is just needlessly convoluted.[1]

If an AI is smart-enough to deceive humans, and it wants to gain access to the grid, I don't see why it wouldn't just hack into the grid. Or the internet. Or server farms. Or whatever it's trying to get.

What am I missing? What situation in the future would make detecting deception in models important?

  1. ^

    Ironically, deceptive tactics in this case would likely correlate with niceness. If you want to peacefully relocate the chimps without disturbing or scaring them, then you might use deception and manipulation. But only if you actually care about their wellbeing.

New Answer
New Comment

3 Answers sorted by

habryka

172

The ELK document outlines a definition of "honesty" and argues that if you achieve that level of honesty, you can scale that to a whole alignment solution. 

I agree that it doesn't necessarily matter that much if an AI lies to us, vs. it just takes our resources. But in as much as you are trying to use AI systems to assist you in supervising other systems, or in providing an enhanced training signal, honesty seems like one of the most central attributes that helps you get that kind of work out of the AI, and that also allows you to take countermeasures against things the AI is plotting. 

If the AI always answer honestly to the question of "are you planning to disempower me?" and "what are your plans for disempowering me?" and "how would you thwart your plans to disempower me?" then that sure makes it pretty hard for the AI to disempower you.

"are you planning to disempower me?" and "what are your plans for disempowering me?" and "how would you thwart your plans to disempower me?"

These questions are only relevant if the AI "plans" to disempower me. It can still be a side effect. Even something the humans at that point might endorse. Though one could conceivably work around that by asking: "Could your interaction with me lead to my disempowerment in some wider sense?"  

Carl Feynman

62

You’re missing the steps whereby the AI gets to a position of power.  AI presumably goes from a position of no power and moderate intelligence (where it is now) to a position of great power and superhuman intelligence, whereupon it can do what it wants.  But we’re not going to deliberately allow such a position unless we can trust it. We can’t let it get superintelligent or powerful-in-the-world unless we can prove it will use its powers wisely.  Part of that is being non-deceptive.  Indeed, if we can build it provably non-deceptive, we can simply ask it what it intends to do before we release it.  So deception is a lot of what we worry about.

Of course we work on other problems too, like making it helpful and obedient.  

a position of no power and moderate intelligence (where it is now)

Most people are quite happy to give current AIs relatively unrestricted access to sensitive data, APIs, and other powerful levers for effecting far-reaching change in the world. So far, this has actually worked out totally fine! But that's mostly because the AIs aren't (yet) smart enough to make effective use of those levers (for good or ill), let alone be deceptive about it.

To the degree that people don't trust AIs with access to even more powerful levers, it's usually because they fear the AI getting tricked by adversarial humans into misusing those levers (e.g. through prompt injection), not fear that the AI itself will be deliberately tricky.

But we’re not going to deliberately allow such a position unless we can trust it.

One can hope, sure. But what I actually expect is that people will generally give AIs more power and trust as they get more capable, not less.

quetzal_rainbow

31

For AI to hack in the grid it is important that humans didn't notice that and didn't shutdown it. To disguise hacking as normal non-suspicious activity is to deceive.

Why would it matter if they notice or not? What are they gonna do? EMP the whole world?

1quetzal_rainbow
I think that they shutdown computer on which unaligned AI is running? Downloading yourself into internet is not one-second process.
5quila
It's only bottlenecked on connection speeds, which are likely to be fast at the server where this AI would be, if it were developed by a large lab. So imv 1-5 seconds is feasible for 'escapes the datacenter as first step' (by which point the process is distributed and hard to stop without centralized control). ('distributed across most internet-connected servers' would take longer of course).

Downloading yourself into internet is not one-second process

which are likely to be fast at the server where this AI would be, if it were developed by a large lab.

Yes, no one is developing cutting-edge AIs like GPT-5 off your local dinky Ethernet, and your crummy home cable modem choked by your ISP is highly misleading if that's what you think of as 'Internet'. The real Internet is way faster, particularly in the cloud. Stuff in the datacenter can do things like access another server's RAM using RDMA in a fraction of a millisecond, vastly faster than your PC can even talk to its hard drive. This is because datacenter networking is serious business: it's always high-end Ethernet or better yet, Infiniband. And because interconnect is one of the most binding constraints on scaling GPU clusters, any serious GPU cluster is using the best Infiniband they can get.

WP cites the latest Infiniband at 100-1200 gigabits/second or 12-150 gigabyte/s for point to point; with Chinchilla scaling yielding models on the order of 100 gigabytes, compression & quantization cutting model size by a factor, and the ability to transmit from multiple storage devices and also send only shards to indiv... (read more)

Imo, provably boxing a blackbox program from escaping through digital-logic-based routes (which are easier to prove things about) is feasible; deception is relevant to the much harder to provably wall off route that is human psychology.

2 comments, sorted by Click to highlight new comments since:

I agree, but I think your example of bulldozing the jungle is not the best one as it evokes the picture of the humans being in sufficient control to begin with to start bulldozing. I think you need to come up with an analogy where the humans are initially held in a prison by chimpanzees (or small children or something). Getting out of prison by lying about how harmless you are is one plausible avenue. 

I have a more realistic example of how it might go. It's actually a real example I heard. 

Two children locked their father in one room by closing the door, using the key to lock the door, and taking the key. And then making fun of him inside, confident that he wouldn't get out (the room being on the third floor). They were mortally surprised when a minute later he was appearing behind them having opened a window and found a way down on the outside (I don't know how, maybe over the balcony to a neighbor or down a pipe) and then just in thru the main entrance with his main key. 

No need to deceive here either.

For less-than-human intelligence, deceptive tactics will likely be caught by smarter humans (when a 5-year-old tries to lie to you, it's just sort of sad or even cute, not alarming). If an AI has greater-than-human intelligence, deception seems to be just one avenue of goal-seeking, and not even a very lucrative or efficient one.

It seems likely to me that there will be a regime where we have transformatively useful AI which has an ability profile that isn't wildly different than that of humans in important domains. Improving the situation in this regime, without necessarily directly solving problems due to wildly superhuman AI, seems pretty worthwhile. We could potentially use these transformatively useful AIs for a wide variety of tasks which could make the situation much better.

Merely human-ish level AIs which run fast, are cheap, and are "only" as smart as quite good human scientists/engineers could be used to radically improve the situation if we could safely and effectively utilize these systems. (Being able to safely and effectively utilize these systems doesn't seem at all guaranteed in my view, so it seems worth working on.)