User Comment Replies

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Ollie J10mo20

Fixed, thanks for flagging

Meta "open sources" LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper)

Ollie J2y10

The link for the github repo is broken, it includes the comma at the end.

2LawrenceC2y

Thanks! Fixed

Human-level Diplomacy was my fire alarm

Ollie J2y92

I wonder how it would update its strategies if you negotiated in an unorthodox way:

"If you help me win, I will donate £5000 across various high-impact charities"
"If you don't help me win, I will kill somebody"

2Lone Pine2y

It was reported that high level diplomacy players have a different game-theoretical situation, because they all know eachother by (user)name. So if DiplomacyGrandmaster69 goes up against TheDiplomancer, they know their games will be publicly streamed, and the other high level players will see how honest they really are. Whereas casual players are playing a single-shot prisoner's dilemma, the pros are playing an iterated prisoner's dilemma, and that makes a difference. I wonder what would happen if CICERO were placed in repeated 6-human-one-AI showmatches where everyone know which one was the AI. How would it fair?

7Lao Mein2y

Think about what happens in the dataset of human games where such conversations take place. It probably adds more uncertainty to the predicted actions of players who say these things. I mean, what would you do if you saw such messages in a game you're playing? Probably assume they're mentally unstable and adjust accordingly.

5Daniel Kokotajlo2y

Or: What if it finds out that all the humans are doing a strategy of "interrogate everyone to find out who the bot is, then gang up on the bot." How does it react?

Contra Hofstadter on GPT-3 Nonsense

Ollie J3y290

There exist many articles like this littered throughout the internet, where authors perform surface-level analysis and ask GPT-3 some question (usually basic arithmetic), then point at the wrong answer and make some conclusion ("GPT-3 is clueless"). They almost never state the parameters of the used model or give the whole input prompt.

GPT-3 is very capable of saying "I don't know" (or "yo be real"), but due to its training dataset it likely won't say it on its own accord.

GPT-3 is not an oracle or some other kind of agent. GPT-3 is a simulator of such agents. To get GPT-3 to act as a truthful oracle, explicit instruction must be given in the input prompt to do so.

2Caspar Oesterheld2y

>GPT-3 is very capable of saying "I don't know" (or "yo be real"), but due to its training dataset it likely won't say it on its own accord. I'm not very convinced by the training data point. People write "I don't know" on the Internet all the time (and "that makes no sense" occasionally). (Hofstadter's article says both in his article, for example.) Also, RLHF presumably favors "I don't know" over trying to BS, and still RLHFed models like those underlying ChatGPT and Bing still frequently make stuff up or output nonsense (though it apparently gets the examples from Hofstadter's article right, see LawrenceC's comment).

Meta wants to use AI to write Wikipedia articles; I am Nervous™

Ollie J3y60

I'm positive that as these language models become more accessible and powerful, their misuse will grow massively. However, I believe open sourcing is the best option here; having access to such model allows us to create accurate automatic classifiers that detect outputs from such models. Media websites (e.g. Wikipedia, Twitter) could include this classifier in their pipeline for submitting new media.

Making such technologies closed source leaves researchers in the dark; due to the scaling-transformer hype, only a tiny fraction of the world's population have the financial means to train a SOTA transformer model.

1Yitz3y

After some consideration, I agree with you. Still can’t say I’m happy about it, but it’s a better option than closed source, for sure.

LESSWRONG
LW

All of Ollie J's Comments + Replies