I am getting worried that people are having so much fun doing interesting stuff with GPT-3 and AI Dungeon that they're forgetting how easy it is to fool yourself. Maybe we should think about how many different cognitive biases are in play here? Here are some features that make it particularly easy during casual exploration.
First, it works much like autocomplete, which makes it the most natural thing in the world to "correct" the transcript to be more interesting. You can undo and retry, or trim off extra text if it generates more than you want.
Randomness is turned on by default, so if you try multiple times then you will get multiple replies and keep going until you get a good one. It would be better science but less fun to keep the entire distribution rather than stopping at a good one. Randomness also makes a lot of gamblers' fallacies more likely.
Suppose you don't do that. Then you have to decide whether to share the transcript. You will probably share the interesting transcripts and not the boring failures, resulting in a "file drawer" bias.
And even if you don't do that, "interesting" transcripts will be linked to and upvoted and reshared, for another kind of survivor bias.
What other biases do you think will be a problem?
As you say, highlist posts give biased impressions of GPT-3's capabilities. This bias remains even for readers who are consciously aware of that fact, since the underlying emotional impression may not adjust appropriately. So, for example, when I tell the reader that "only 30% of completions produced correct answers", that isn't the same as seeing the 70%-dumb answers.
Another problem is that AIDungeon doesn't let you save the entire tree of edits, reversions, and rerolls. So, even if you link the full transcript, readers are still only getting the impressive version. If you wanted to overcome this, you'd have to bore readers with all of the stupid runs. No one wants to do that.
I'm currently:
I'd love to hear other suggested best-practices.
For my part, I think a lot of questions I have about GPT-3 are, "is there a non-negligible chance it produces correct answers to fresh problems which seemingly require reasoning to solve?". So far, I'm very impressed at how often that has been true.