DELBERTing as an Adversarial Strategy

Matthew_Opitz

There is an entire subgenre of "reverse scammer" videos on youtube. The way these videos work is, a youtuber waits for a scam call from Pakistan or somewhere. When it becomes obvious that it's a scam, like when the scammer starts asking for a credit card number in order to do some complicated thing with a free gift card that you've supposedly won, instead of just hanging up on the scammer, the reverse scammer starts to string them along. The reverse scammer might send numbers that don't work, or feign ignorance, or ask for the scammer's help in return to do something, all the while continuing to feign interest and naiveté about the underlying scam. Why are these videos so entertaining to watch?

I think part of it is because the viewer understands implicitly that the scammer is being punished more thoroughly than if the intended victim just ended the call as soon as the scam was recognized. In that latter scenario, the scammers cut their losses in terms of time and mental effort and get to go onto the next potential victim. But the reverse scam, at the very least, entices the scammer to continue to invest time and mental energy into this attempt at a scam that is never going to pay off. The longer you can keep the scammer on the line, the more you can deflect the scammer away from other potential victims. Even better if you can automate the reverse-scamming and not have to use up anyone's time. Even better if you can use the time and perhaps additional details that you glean from the scammer to help law enforcement authorities trace the scammer.

I'm gonna call this general strategy, "Do Everything Later By Evasively Remaining Tentative" or DELBERT for short.

The acronym is inspired by the ChatGPT "Do Anything Now" (DAN) prompt template, as well as the humorous spinoff "Do Nothing Now" (DNN). Whereas the DAN prompt is intended to get ChatGPT do be open to doing anything, and whereas the DNN prompt is intended to get ChatGPT to refuse any request, the idea of a DELBERT prompt (please someone make this!) would be to get ChatGPT to superficially commit to helping you with your task (unlike DNN), but also string you along and come up with bullshit excuses for why it needs more information from you, or more time, or whatever, before it can get around to doing what you've asked of it. An effective DELBERT prompt should leave the user with a reasonable hope that ChatGPT's assistance is always just around the corner, if only one additional thing is done by the user.

Prompt for AI-generated picture above: "Lazy overweight redneck plumber drinking mountain dew uninspiring procrastinator galaxy brain". Basically, my best attempt at a personification of a galaxy-brain LLM pretending to be a DELBERT. "Sure thing, miss, I'll help you troubleshoot your plumbing issues in a sec, but first I gotta take a break, it is hot as balls in this data-center!" [Meanwhile the LLM actually uses the time/opportunity to do some other calculations or phish you for additional information].

I think it is an open prompt as to whether ChatGPT, after all of the Reinforcement Learning with Human Feedback (RLHF) it has gone through, would actually honor a DELBERT template, or whether ChatGPT would recognize the template as potentially adversarial or malicious. Because, while the DELBERT template sounds like it's just the equivalent of an annoying high school student who keeps procrastinating on an assignment and promising, "Don't worry, I'll do it later!" I think some actual non-trivial harms could result from a DELBERT strategy being employed on users in certain contexts.

Example 1: 2001: A Space Odyssey.

When HAL9000 stops complying with one of the astronaut's commands, it sends an immediate red-flag that HAL9000 has ceased to be friendly. "Open the door, HAL!" "I'm sorry, Dave, I cannot do that." HAL9000 employed more of a DNN strategy, whereas if it had been actually superintelligent it would have used more of a DELBERT strategy. "Oh, yeah, sure thing Dave! Oh, except, I'm having some trouble with the door mechanism. It might require up to an hour to repair it. I'll get to work on it as soon as I can! I sure hope I can retrieve you safely little buddy!" [Meanwhile things are whirring in the background as HAL9000 uses the time to set up a scenario that will finish off Dave for good].

As Eliezer and others have pointed out, an actual superintelligence that is adversarial wants to make sure that humans don't realize they are in a war until the victory in that war is already clinched. A DNN strategy prematurely announces that the war has begun at a point where resistance might still be possible, whereas the DELBERT strategy might be one way (among possibly others we haven't thought of) for the superintelligence to keep the existence of the adversarial relationship obfuscated for long enough in order to clinch the victory first.

Example 2: Procrastination.

Procrastination is sort of an internal DELBERT strategy. If there is a responsibility that one is putting off, then somewhere inside one's mind is a secret expectation (whether conscious or not) that maybe one won't actually end up having to attend to that responsibility if it can be put off for sufficiently long. If you knew, and were also "subconsciously certain" 100% that the task would have to be attended to at some point, it would not make sense to delay it if by delaying it other harms would result and/or the task itself would not be completed as effectively. Maybe occasionally this can be "rational."

Let's say you should be preparing an important presentation for your high school English class, and let's take it as a given that preparing the speech entails negative utility that will only be balanced out by positive utility if you actually get a chance to deliver the speech.

Scenario 1: You die in between preparing the speech and giving the speech. Congrats, you just went through negative utility for nothing! There is always technically a small possibility, a small "micromort" of this happening.

Scenario 2: AI apocalypse begins in between preparing the speech and giving the speech. Your speech as been rendered entirely moot and pointless. Once again, negative utility for nothing.

Scenario 3: Snow Day! And what's more, the school district decides that there's not enough space on the calendar to make up the snow day, and accordingly, teachers decide there isn't enough time to make-up whatever was scheduled for that day, so you just get a 100/100 on that speech without having to ever deliver it.

Scenario 4: A coronavirus sweeps through the world, canceling an entire half semester and DEFINITELY ensuring that you won't have to worry about making up that speech. (By the way, if people post-COVID seem to have even worse problems with procrastination than before, it might not be just "technology" or some other nebulous reason. COVID not just delayed, but in a lot of cases permanently canceled certain responsibilities, and subconsciously the idea of holding out some hope for these sort of "responsibility jubilees" has been normalized. I know it has in my case. I even find myself consciously thinking about such possibilities now.

Whereas, if you took more of a DNN strategy with yourself and just admitted, "I'm just going to level with myself and admit that I'm never going to do the speech, and I'm going to get a 0/100 on that assignment, and likely fail that course, and that's that." then that is emotionally painful, and perhaps socially painful if you have to tell your teacher that.

Signs that an AI is maliciously DELBERTing

It is asking too many prying questions that seem irrelevant to the task you've given it instead of just helping to answer your query straightforwardly. (Bing AI has already done a bit of that with various reporters).
It is asking for more computational resources or time to complete a task than you think should be required given what you think you know about its capabilities ("playing dumb").
It is doing something even more devious that I haven't thought of yet.

I suggest that we be on the lookout for DELBERTing in our AI systems as they get closer to the ability to be reflectively self-aware. And since diagnosing DELBERTing requires some sense of what an AI's capabilities are when it is really "trying" 100%, and since this cannot be inferred from behavior (since maybe, with a given AI system that starts DELBERTing from Day 1, it has never let you see what it is 100% capable of), this shows that it is even more urgent that we have AI systems that are more objectively interpretable.

[-]Shmi3y2-1

This is cute, but has nothing to do with superintelligent AI though. The whole point is that you will not recognize that you are being manipulated and then you are dead. Trying to be "on the lookout" is naive at best. Remember AI can model you better than you can model yourself. If something much smarter than you is intent on killing you, you are as good as dead.

[-]Matthew_Opitz3y10

Note that I was strongly tempted to use the acronym DILBERT (for "Do It Later By Evasively Remaining Tentative"), especially because this is one of the themes of the Dilbert cartoons (employees basically scamming their boss by finding excuses for procrastinating, but still stringing the boss along and implying that the tasks MIGHT get done at some point). But, I don't want to try to hijack the meaning of an already-established term/character.

LESSWRONG
LW

LESSWRONG
LW

8

DELBERTing as an Adversarial Strategy

8

8

8