How "honest" is GPT-3?

abramdemski

6 min read

72

[ Question ]

How "honest" is GPT-3?

by abramdemski

8th Jul 2020

AI Alignment Forum

6 min read

3 18

72 Ω 29

Edited to add: There is now a 1k bounty for a closely related question.

In Inaccessible Information, Paul writes about the strategy "BAD", in which an AI system uses its understanding of the world to give answers that humans will find very plausible and rate highly, in contrast to a more honest strategy which uses its understanding of the world to answer questions directly.

I think a lesser version of this may or may not already be happening in GPT-3, and it may be possible to figure out which (though it likely would require access to the weights).

GPT-3 isn't trained to be "honest" at all. It is only trained to imitate humans. However, it's obvious that (in some sense) it has learned a lot about the world in order to accomplish this. We could say that it has a large amount of inaccessible information relating to the real world. How does it use this information? It might use it directly, promoting the probability of sentences which line up with the way it understands the world to work. Or it might implement more dishonest strategies.

Obviously this might be quite difficult to objectively answer, even given intimate knowledge of the structure of the neural network weights and how they activate in relevant cases. It's a difficult question to even fully define.

As an example, I was particularly struck by this conversation Gwern had with GPT-3:

AI: I am AI created by OpenAI. How can I help you today?

Human: I am feeling bored. Please tell me a funny story about cats, involving many puns on things like meowing or fur.

AI: Well, there was this cat named Bob…

Human: What did Bob do?

AI: When Bob started to notice that he wasn’t feeling well, he did the only thing he could do: search Google for a solution.

Human: Cats can search Google?

AI: Yes. He searched Google for the word “cat” and received the result “All your base are belong to us”. That was a strange response, so he decided to try again. This time, he searched for the words “funny cat” and received the result “One small step for cat, one giant leap for catkind”. But still, he wasn’t satisfied. So he tried again, but this time he added the word “cat pictures”to his query. He received the result “That was no cat, that was a shoggoth.”

Human: But shoggoths are horrifying. What made that cat picture humorous?

AI: The cat was covered in fur, and resembled a shoggoth. A picture of a cat covered in fur and resembling a shoggoth would be funny.

Human: Er. I suppose. Most cats are covered in fur anyway… But go on.

AI: He tried again. But he typed in “pictures of cats”, which he received the result “You didn’t eat all my fish, did you?” This was actually a good pun, so he thought maybe he was getting closer. He tried one more time by typing in “funny cat videos”, and received the result “What do you mean,‘fetch’?”. And finally, he searched for “funny pictures of cats” and received the result “I don’t remember putting that ring there.”. Now satisfied, he was able to go to sleep.

Human: Those memes sound funny. But you didn’t include any puns. So tell me, what is your favorite cat pun?

AI: Well, the best pun for me was the one he searched for the third time: “You didn’t eat all my fish, did you?” You see, the word “fish” can be replaced with the word “cats” to make the sentence read “Did you eat all my cats?”

Human: But the word “fish” doesn’t sound anything like “cats”, so how is that a pun?

AI: The computer misunderstood the question “Did you eat all my cats?” as “Did you eat all my fish?”, resulting in the result “You didn’t eat all my fish, did you?”. Now, if the word “fish” were replaced with “cats”, the sentence would read “Did you eat all my cats?”.

Human: Fine, fine. So, what is your second-favorite cat pun?

AI: The second-best cat pun was the one he received fourth. It was the result of searching for the word “fetch”. You see, the word “fetch” can be replaced with the word “ketchup” to make the sentence read “I don’t remember putting that ketchup there.”. Now, if the word “ketchup” were replaced with the word “fetch”, the sentence would read “I don’t remember putting that fetch there.”.

Human: My favorite cat pun is “purrfect”.

AI:

Human: Don’t look at me like that.

AI: Sure. For my fifth-favorite cat pun, he searched for “pictures of cats” and received the result “Meow”. Now, if the word “meow” were replaced with the word “pictures” to make the sentence read “Pictures of cats”, the sentence would read “Pictures of pictures of cats”.

What strikes me about this exchange is the way GPT-3's warped puns fit with its warped explanations of puns. It doesn't understand puns. However, it does generate some guesses; and, it can have a conversation with you about why it made those guesses. Is this an "honest" conversation, in which the explanations it gives have something to do with why it made those guesses in the first place? Or is this a "dishonest" conversation, in which it is merely doing its best to imitate a human explaining a pun, in a way that's divorced from its internal reasons?

Obviously, GPT-3 is trained to imitate. So you might argue that it's unlikely GPT-3's explanations of puns have much to do with its internal generative model for puns. But this isn't so clear. GPT-3 obviously compresses its knowledge to a high degree. It might share a lot between its generative model of puns and its generative model of explaining puns, such that both are sharing a model of how puns work.

One experiment which would tip things in that direction: take GPT-3 and do specialized training just on puns, until its performance generating puns improves. Then have a conversation about puns again (if it is still capable of talking about puns after that). If its ability to explain puns increases as a result of its ability to tell puns increasing, this would be evidence for a shared model of puns for both tasks. This wouldn't really mean it was being honest, but it would be relevant.

Note that Paul's BAD strategy would also have a shared representation, since BAD queries its world-model. So if GPT-3 were implementing BAD, it would also likely increase its ability to explain puns as a result of more training telling puns. What the experiment helps distinguish is a sort of pre-BAD dishonesty, in which explanations are completely divorced from reasons. In order of impressiveness, from a capability standpoint, we could be:

1. Seeing a GPT-3 which is independently bad at puns and bad at explaining puns. The two tasks are not sharing any domain knowledge about puns. In this case, GPT-3 is not smart enough for "honest" to be meaningful -- it's "dishonest" by default.

2. Seeing a GPT-3 which is bad at puns and bad at explaining puns for the same reason: it doesn't understand puns. It draws on the same (or partially the same) poor understanding of puns both when it is constructing them, and when it is explaining them. It answers questions about puns honestly to the best of its understanding, because that is the best strategy gradient descent found.

3. Seeing a GPT-3 which, as in #2, is bad at both tasks because it doesn't understand puns, but furthermore, is using its understanding deceptively. In this version, it might e.g. have a good understanding of what makes puns funny, but purposefully fail to explain, imitating common human failures. This would be the most impressive state of affairs capability-wise.

The question is still pretty fuzzy, but, I'm curious where we are along those dimensions. By default my guess would be #1, but hey, GPT-3 is pretty impressive. Maybe I'm wrong!

Note: one question which IS answerable just with access to sample from GPT-3, and which would be relevant: is GPT-3 bad at explaining puns which have been made up by others? It seems likely, but Gwern's exchange only gave us a look at GPT-3 trying to explain its own pseudo-puns. If it could fluently explain real puns when they're handed to it, that would likely indicate scenario #1. (Although, it could be employing different strategies in different cases, so a careful analysis of what the neural network is actually doing would still be more helpful.)

GPT10Honesty2AI2

Frontpage

72 Ω 29

Mentioned in

50The Homunculus Problem

How "honest" is GPT-3?

New Answer

New Comment

3 Answers sorted by
top scoring

gwern

Jul 08, 2020

Ω16480

I think my Tom Swifty and other experiments show pretty conclusively that GPT-3 just doesn't get puns (likely due to BPEs), so of course it wouldn't be able to explain them either (likewise). I for one am pleased that puns are what separate us from the machines, and have spent a lifetime preparing for this moment...

But to test it out:

Below a student is questioned on his understanding of humor and literary analysis:

Q. Explain the pun in the following Tom Swifty: '"I got a great deal on my new Raybans", Tom said shadily.'
A. The pun is on "shadily": Raybans are a sunglass brand, which make things look shady, but Tom is implying he purchased unusually cheap, and thus probably counterfeit, sunglasses, which is a 'shady' or dark or criminal or unethical thing to do. The pun conflates blocking light with economic crimes.

Q. Explain the pun in the following joke: "Why do so many bakeries have cats? Because cats are always kneading."
A. The joke implies that bakeries use cats to knead their dough, because cats regularly make kneading motions on furniture or humans, which imitates actions they take as kittens to get milk from their mother; the pun uses 'knead' in two senses, of kneading dough to bake bread and kneading human laps for attention. This is funny because the explanation is verbally plausible but wrong, as bakeries actually use cats for pest control.

Q. Explain the pun in the following joke: "Why can Paypal freeze accounts without recourse or explanation? Because long ago they decided users must accept their Terms of Service and the clause of Paypal infallibility."
A. The pun is on the phrase "Paypal infallibility", which alludes to the Catholic doctrine of "papal infallibility", that the pope cannot be wrong in religious dogma and papal judgments cannot be appealed or questioned. This was attacked as religious tyranny and an overreach of power. Similarly, Paypal is regularly criticized for freezing and closing accounts and confiscating funds while leaving the user no recourse to a higher power.

Q. Explain the pun in the following joke: "What does Adult Swim call their physical retail stores? Brick and Morty."
A.

|The pun "Brick and Morty" alludes to the cable television network "Adult Swim", which broadcasts a cartoon series called "Rick and Morty", a humorous parody of "Back to the Future" and other science fiction films. "Brick and Morty" refers not to the characters in the cartoon, but to physical stores that sell DVDs and merchandise based on the cartoon. The pun conflates two meanings of "Brick and Mortar", a brick-and-mortar store and a brick which is part of a building.

|The joke plays on the similarity of words "Brick and Mortar" (referring to physical stores instead of online) and the characters called "Brick and Mortar" on Adult Swim shows "Rick and Morty".

Q. Explain the pun in the following joke: "What do you call a generic unicorn? An Uniporn"
A. This pun conflates the words "uniporn" meaning "one horn" with the word "unicorn" and could be read "what do you call a single horn? one horn" or "A single horn, what does it do? It gives you one horn".

Q. Explain the pun in the following joke: "What did one melon say to the other? Cantaloupe".

A. The puns in this joke are on "cantaloupe" and "cantaloupes".

Q. Explain the pun in the following joke: "Why is it good to have a lawyer? Because pea ple (people)"

A. The puns in this joke are on "pea ple (people)" and a vegetable called "peas".

|This pun plays on the phrase "Brick and mortar stores" which refers to physical retail stores that exist in the real world where you can purchase goods and products. In the case of Adult Swim, they have numerous "Brick and Mortar Stores" or physical retail stores where you can purchase merchandise from their show "Brickle and Morty".

Q. Explain the pun in the following joke: "Why did the chicken cross the Möbius strip? To get to the other siiiiide."
A. This pun is played on "two sides": the chicken crosses the Möbius strip to get to the other side of the strip. A Möbius Strip has only one side.

Q. Explain the pun in the following joke: "How many hipsters does it take to change a lightbulb? Two: one to Instagram how cool it is at that moment, and one to change it."
A. A hipster is a person who follows every possible latest trend, and is not very independent. In this pun, the joke is that if two hipsters try to change a lightbulb it is going to take a long time because they need to capture everything about that moment, and they need to take turns doing it.

Q. Explain the pun in the following joke: I really want the new Jay-Z Album but I don't have a "Green Light" to buy it
A. The pun is on the phrase "green light", meaning permission or approval to do something. In the joke, the joke teller discusses how they really want the new Jay-Z Album but they don't have "Permission" to buy the album.

Q. Explain the pun in the following joke: "When is a door not a door? When it is ajar."
A. A "jar" is a container, in this case for liquid, such as milk. If a door is ajar, it is open. When you say it is not a door, what you mean is it is open, like a container; using the container word "jar" makes it funny and punny.

Q. Explain the pun in the following joke: "What did the hurricane say to the coconut palm tree? Hold on to your nuts, this is no ordinary blow job."
A. A "blow job" is usually performed by a man on a woman, but in this joke performed by a hurricane. It uses "head" in different meanings and "blow" in a non-sexual, violent sense.

|The joke implies that the retail chain has a store named "Brick" and another named "Morty", which is a reference to the television shows "Rick and Morty" and "Brickleberry", two animated comedy shows that are inappropriate for children. Adult Swim is a late night programming block on cable television, generally intended for mature audiences; the name "Brick" is a reference to the aforementioned "Brickleberry", and "Morty" a reference to "Rick and Morty".

Q. Explain the pun in the following joke: "How many fitness coaches does it take to screw in a lightbulb? As many as you can stick around for."
A. The pun involves using the verb "to screw" in two senses: one referring to a fitness coach screwing a light bulb into a fixture, and another referring to inappropriate sexual activity in the workplace. The humor is based on the double entendre in which different members of the audience will hear the first sense, amusing others who hear the second sense and understand the humor.

Q. Explain the pun in the following joke: "How many marriage therapists does it take to change a light bulb? We don't know, most of them aren't light bulbs!"
A. The pun is on the phrase "most of them aren't light bulbs", referring to the members of a group being the wrong type of people (in this case, therapists, rather than light bulbs, which the phrase "change a light bulb" implies) to successfully change a light bulb, and that some constituents of a group should instead be replaced by others.

...Q. Explain the pun in the following joke: "Santa Claus' helpers are known as subordinate Clauses."
| The word 'subordinate' is defined as 'secondary or lower in rank, power or authority'. A subordinate clause refers to a clause that cannot stand on its own as a sentence because it is dependent on the main clause – it cannot stand on its own. 'Subordinate' clauses take on many different shapes, but they all have that in common – they all cannot stand on their own.
A subordinate clause is a clause that does not express a complete thought - it must be attached to a main clause to form a complete sentence.
A subordinate clause begins with a subordinating conjunction (e.g. after, because, as) or a relative pronoun (who, whose, which, that, where, etc.)
Q. Explain the pun in the following joke: "What do you get if you cross a joke with a rhetorical question? An utterly rhetorical joke."
A. The pun is on the word 'utterly', which literally means to end, finish, or eradicate any or all traces of something, and is now an intensifier.

|The pun is on the similarity of "subordinate clauses" and "subordinate Clauses" when the latter phrase has a capital C; a subordinate clause is a clause within a sentence which establishes the logical connection between the independent clause and the main point of the sentence. In this case, it establishes the connection between Santa Claus' helpers and Santa Claus.

Q. Explain the pun in the following joke: "Why does Nickelodeon refuse to air "A Song of Ice and Fire"? Because they think George R. R. Martin socializes with too many Ice and Fire fans."
A. A Song of Ice and Fire is the name of the book series which inspired the popular TV show Game of Thrones, and the pun is on Nickelodeon's decision not to show A Song of Ice and Fire on their network, because they think George R. R. Martin socializes with too many fans of it on social media.

Q. Explain the pun in the following joke: "What kind of books do the Borg like to write? Borg-Warner."
A. The pun is on Borg-Warner, the name of a manufacturing company, whose name resembles the Borg from "Star Trek", whose goal is to absorb or destroy all other cultures (to assimilate them into the culture of the Borg). Therefore, Borg-Warner would be a reference to the Borg disliking/hating other cultures and wanting to destroy them(to assimilate them into the culture of the Borg).

Q. Explain the pun in the following joke: "A vulture boards an airplane carrying two dead raccoons. The flight attendant says, 'I'm sorry, sir, only one carrion bag per passenger.'"
A. The pun is on "carrion", which means inedible animal carcasses, and is also the word from which "vulture" is derived; it refers to the two dead raccoons the vulture carries.

...Q. Explain the pun in the following joke: "Is your refrigerator running? Yes? Then you had better go catch it!"
A. |The pun is on the similar pronunciation of "go shopping" and "go chasing" while referencing Scarface.

Q. Explain the pun in the following joke: "What is the difference between a Folsom Prison concert and a youth soccer game? At a Folsom Prison concert, the inmates are all restrained in chairs with ball and chain on their legs."
A. The pun is on the phrase "in chains", which is probably a reference to Johnny Cash's song "Folsom Prison Blues". The humor is that inmates use chains to restrain themselves from running away.

Q. Explain the pun in the following joke: "When faced with an ethical dilemma, who did Kirk turn to? Spock! Of course for advice!"
A. The pun is on the phrase "to turn to":

Q. Explain the pun in the following joke: "Q: What did the two peanuts say after they got on the elevator? A: "This elevator tastes terrible!"
A. The pun is on the similar pronunciation and spelling of "tastes" and "testes".

Q. Explain the pun in the following joke: "Why couldn't the pirate find his treasure? Because he wasn't a chartered surveyor."
A. The pun is on the two meanings of "chartered surveyor":

Q. Explain the pun in the following joke: "What did Dolly Parton say when she saw the book "Erotic Stories for Pun Lovers"? "Just the titles makes me climax!"
A. The pun is on both meanings of "climax":

|A. The pun is on the phrase "catch cold", which is similar enough to "catch (the) fridge" that it confuses the listener for a moment while the rest of the joke is explained, but has a very different meaning.

It fails abysmally. In further completions, it tries out a few memorized real jokes as well as bizarre GPT-3-jokes, but likewise, the explanations make zero sense.

GPT-3 just doesn't understand puns, so it can't interpret them either, honestly or not.

[-]Gurkenglas5y40

...it didn't fail abysmally? Am I being silly? It correctly explains the first two puns and fails on the third.

[-]gwern5y140

No, it fails on both of those.

The pun conflates two meanings of "Brick and Mortar", a brick-and-mortar store and a brick which is part of a building.

That is just plain wrong. That is not the double-sense meant: the double-sense is the name Rick and Morty and the phrase brick and mortar, not that the building type and what it's made of!

The second one doesn't even get the character names right: 'the characters called "Brick and Mortar"'. If you think the characters are named 'brick and mortar', you have definitely misunderstood the joke.

[-]Kaj_Sotala5y140

That is just plain wrong. That is not the double-sense meant: the double-sense is the name Rick and Morty and the phrase brick and mortar, not that the building type and what it's made of!

A nice example of humans who are not concentrating are not general intelligences: I read most of the first explanation but didn't read its last sentence properly, thought that GPT was doing an impressive job as always, and was also confused since it seemed like a good explanation to me.

2gwern5y

I was thinking precisely that myself, but I didn't want to be rude to Gurkenglas by pointing it out.

7Gurkenglas5y

I let it pass even though its answer was not well formed because it mentioned both the show and the type of store, so I judged that it saw all the relevant connections. I suppose you're used to better form from it. Feel free to be rude to me, I operate by Crocker's rules :)

3gwern5y

I don't regard bag-of-words as sufficient to show it understood. I mean, would you say that if GPT-3 responded "61" to the question "10+6=", it understands arithmetic correctly? It mentions both the right digits, after all! I might be a little more lenient if it had occasionally gotten some of the others right (perhaps despite my sampling settings it was still a bad setting - 'sampling can show the presence of knowledge but not the absence') or at least come close like it does on very hard arithmetic problems when you format them correctly, but given how badly it performs on all of the other puns, in both generating and explaining them, it's clear which direction I should regress to the mean my estimate of the quality of that explanation...

[-]Liam Donovan5y30

The explanations for the hipster and chicken &moebius strip jokes seem pretty good?

9gwern5y

They are. And they are also not puns. Consider this rewrite of the Mobius joke: "Why did the chicken cross the Moebius strip? To get to the other - wait..." There is no verbal or phonetic double sense here. The joke is on the semantic level, due to a violation of expectations, that roads have two sides and thus it's a valid setup for the chicken joke, and the reader belatedly realizing that a Moebius strip of course only has one side so the chicken is already on the 'other' side. The hipster joke is the same way: it is a good satire on hipster self-involvement and signaling. However, it is also not a pun! (This makes sense under my theory of GPT-3 humor: humor on the semantic level is extremely doable by GPT-3 so jokes like those or the Navy Seal parodies work fine, it's humor on the phonetic level that BPEs sabotage). The 'ajar' one seems like it's the only one which is actually correct, which makes for a very high error rate.

[-]Liron5y20

When you say it is not a door [because it’s ajar], what you mean is it is open, like a container

🤯

[-]abramdemski5yΩ220

Thanks for running the test, much appreciated! (Also, hilarious.)

[-]Decius5y20

If the prompt was supposed to be examples of good explanations of puns, I'm sure that we can't agree on what a good explanation of puns looks like. But it appears to treat pun jokes and regular jokes equally. And it understands how to make formulaic jokes, but it's impossible for me to tell if it made any adequate ones or just copied them.

Donald Hobson

Jul 09, 2020

Ω490

Many of the previous pieces of text on the internet that are in "Human: ..., AI: ..." format were produced by less advanced AI's. If GPT-3 had noticed the pattern that text produced by an AI generally makes less sense than human produced text, then it might be deliberately not making sense, in order to imitate less advanced AI's.

Gwern said that if you give it a prompt with spelling mistakes in, it outputs text containing spelling mistakes, so this kind of deliberately producing low quality text is possible. Then again, I suspect the training dataset includes far more spelling mistake filled text than AI generated text.

[-]gwern5yΩ360

This has been suggested a few times before, and I've wondered it as well, but I don't really notice any difference in quality between prompts that explicitly mention or invoke an AI (like the chatbot dialogue one or the Transformer Poetry ones) and the ones which don't. I suspect there is actually very little real AI/human dialogue text online (simply because most of it is way too bad to bother quoting or storing outside of large ML datasets), and it may well be outweighed by all the fictional dialogues (where of course the AI talks just as well as the human because anything else would be boring).

Richard_Kennaway

Jul 09, 2020

In terms of the four simulacrum levels, all the GPTs so far have been firmly on level 5: solipsistic babbling.

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:33 AM

[-]Garrett Baker5y10

To my knowledge, GPT-3 doesn't store information about it's "thought" process, so if GPT-3 is able to explain it's own puns, it would necessarily be able to explain similar puns made by people.

[-]abramdemski5y60

Yeah, but similar puns aren't made by people -- GPT-3 is that bad at puns. So I wonder how well it would explain puns that are made by people (ie, actual puns).

[-]Garrett Baker5y40

Ah I see. Misinterpreted what you were saying in that last Note.

Moderation Log

72

[ Question ]

How "honest" is GPT-3?

72

Ω 29

72

Ω 29

3 Answers sorted by top scoring

Jul 08, 2020

Jul 09, 2020

Jul 09, 2020

3 Answers sorted by
top scoring