I am getting worried that people are having so much fun doing interesting stuff with GPT-3 and AI Dungeon that they're forgetting how easy it is to fool yourself. Maybe we should think about how many different cognitive biases are in play here? Here are some features that make it particularly easy during casual exploration.
First, it works much like autocomplete, which makes it the most natural thing in the world to "correct" the transcript to be more interesting. You can undo and retry, or trim off extra text if it generates more than you want.
Randomness is turned on by default, so if you try multiple times then you will get multiple replies and keep going until you get a good one. It would be better science but less fun to keep the entire distribution rather than stopping at a good one. Randomness also makes a lot of gamblers' fallacies more likely.
Suppose you don't do that. Then you have to decide whether to share the transcript. You will probably share the interesting transcripts and not the boring failures, resulting in a "file drawer" bias.
And even if you don't do that, "interesting" transcripts will be linked to and upvoted and reshared, for another kind of survivor bias.
What other biases do you think will be a problem?
The main thing I've noticed is that most of the posts that are talking about its capabilities (or even what theoretical future entities might be capable of, based on a biased assumption of this version's capabilities) is that people are trying to figure out how to get it to succeed, rather than trying to get it to fail in interesting and informative ways.
For example, one of the evaluations I've seen was having it do multi-digit addition, and discussing various tricks to improve its success rate, going off the assumption that if it can fairly regularly do 1-3 digit addition, that's evidence of it learning arithmetic. One null hypothesis against this would be "in its 350-700GB model, it has stored lookup tables for 1-3 digit addition, which it will semi-frequently end up engaging.
The evaluation against a lookup table was to compare its success rate at 5+ digit numbers, and show that storing a lookup table for those numbers would be an increasingly large portion of its model, and then suggests that this implies it must be capable, sometimes, of doing math (and thus the real trick is in convincing it to actually do that). However, this ignores significantly more probable outcomes, and also doesn't look terribly closely at what the incorrect outputs are for large-digit addition, to try and evaluate *what* exactly it is the model did wrong (because the outputs obviously aren't random).
I've also seen very little by the way of discussing what the architectural limitations of its capabilities are, despite them being publicly known; for example, any problem requiring deep symbolic recursion is almost certainly not possible simply due to the infrastructure of the model - it's doing a concrete number of matrix multiplications, and can't, as the result of any of those, step backwards through the transformer and reapply a particular set of steps again. On the plus side, this also means you can't get it stuck in an infinite loop before receiving the output.