I did two small experiments on the GPT-2 small model. First experiment: can GPT-2-small answer sentiment analysis questions? (It can't.) Second experiment: When GPT-2 writes continuations of Howl, is it picking up the "Moloch in X!" template from its priming, or from a copy of Howl in its original training set? (It's from the training set.)
Sentiment analysis experiment:
I downloaded the MPQA Subjectivity Lexicon, which is a dictionary in which words are marked as positive or negative. For example hopelessness=>negative, humour=>positive, grace=>positive, corruption=>negative. I primed GPT-2 with a list of 20 questions like "Is a <noun> good? Yes. Is a <noun> good? No." followed by an unanswered question of the same form, and had it continue for one more word. In its priming, half the answers were yes and the other half were no. It answered "No" 37/40 times, and neither its answers nor its yes answers were better than chance.
Howl experiment:
When given some lines from Ginsberg's Howl as priming, it writes a good continuation (similar to the one Chelsea Voss and Qiaochu Yuan got from it). In particular, it uses the "Moloch in X!" template repeatedly.
If I take its continuation of Howl and feed it back in as a prompt, I get more Howl (Moloch in X!). If I take Howl and replace "Moloch" with "Lomoch", I get more Howl. But if I take its continuation of Howl from the first step and replace Moloch with Lomoch *there*, I get unrelated text which does not use the "Moloch in X!" template.
So, it isn't inferring the template from its priming; rather, it learned the template from its training set (which probably included Howl), and it produces Howl-like text iff it's given a cue strong enough to remind it of the source.
Thanks for writing this up! I'm excited to see more people running experiments like this.
When you say "if I take X as a prompt, I get Y," how many trials did you wait? In my own experimentation I've found lil' GPT-2's performance to be really variable across trials, and I've needed to wait 5 trials in some cases to get results I even sort of liked.
My sense overall of how lil' GPT-2 functions after playing with it for awhile on several different kinds of prompts is that it has a strong sense of genre, and has done something like learned a bunch of different genre conventions for the different types of texts in the training set. If the prompt strongly resembles a genre familiar from training it will run with that genre, although sometimes it'll wander off into another genre. It does quite poorly on prompts that I suspect don't strongly match to a genre in the training set.
For example, I tried to run a Turing test (mostly as a joke) by prompting with "Question: what is your name? Answer:" and I got this on my ~2nd to 4th trial (don't remember), with my speculations as to genre in [square brackets]:
Was planning on posting a longer (mostly humorous) post with my own results but that post is low priority so I don't know when it's going to happen.
Try varying lines 14 and 16 in the interactive script for quicker execution, and try giving it a few example lines to start with.