I am getting worried that people are having so much fun doing interesting stuff with GPT-3 and AI Dungeon that they're forgetting how easy it is to fool yourself. Maybe we should think about how many different cognitive biases are in play here? Here are some features that make it particularly easy during casual exploration.
First, it works much like autocomplete, which makes it the most natural thing in the world to "correct" the transcript to be more interesting. You can undo and retry, or trim off extra text if it generates more than you want.
Randomness is turned on by default, so if you try multiple times then you will get multiple replies and keep going until you get a good one. It would be better science but less fun to keep the entire distribution rather than stopping at a good one. Randomness also makes a lot of gamblers' fallacies more likely.
Suppose you don't do that. Then you have to decide whether to share the transcript. You will probably share the interesting transcripts and not the boring failures, resulting in a "file drawer" bias.
And even if you don't do that, "interesting" transcripts will be linked to and upvoted and reshared, for another kind of survivor bias.
What other biases do you think will be a problem?
The biggest bias is bad prompting and sampling. Cherrrypicking and selection can promote good samples from levels like 1:100 to 1:1; but bad prompting and sampling can sabotage performance such that good samples plummet from 1:1 to 1:billions or worse*. People like to share good samples, but critics like to share bad samples even more. (How many times have you seen Kevin Lacker's post linked as 'proof' of GPT-3's inherent stupidity in places like Marginal Revolution, Technology Review, Wired, and so on - despite being wrong?)
* set temperature to 0, so GPT-3 is deterministic, and use a prompt where greedy sampling gets the wrong answer, and GPT-3 will always return the wrong answer, even if temp=1+ranking would return the right answer ~100% of the time! This has in fact happened in some of the error cases being retailed around.
I was making a different point, which is that if you use "best of" ranking then you are testing a different algorithm than if you're not using "best of" ranking. Similarly for other settings. It shouldn't be surprising that we see different results if we're doing different things.
It seems like a better UI would help us casual explorers share results in a way that makes trying the same settings again easier; one could hit a "share" button to create a linkable output page with all relevant settings.
It could also save the alternate responses that either the u... (read more)