My immediate impression: GPT-3 does better than I expected on the jokes, but still worse than PaLM (possible exception: if GPT-3 is right about CL standing for "cover letter"; I genuinely don't know what it stands for here and as a result I am probably doing worse than at least one of the two language models at understanding that joke) -- but it's much, much worse than PaLM on the "inference chaining" examples, where GPT-3 basically gets everything completely wrong (maybe it gets partial credit for input E, the one about airplanes).
(But we don't know how cherry-picked the PaLM examples are, and we know that they were not cherry-picked at all for GPT-3.)
I think it's interesting that both PaLM and GPT-3, when answering the question about Guido van Rossum, use the exact phrase "would not need to look up variable scope on StackOverflow". It's a natural enough thing to say, but it doesn't seem like it's the only plausible way to say it, so I can't help wondering whether maybe they're both quoting something -- though I can't find that phrase on the web other than in copies of this very paper. (And GPT-3 doesn't seem to have noticed how this fact about GvR is relevant to answering the question.)
CL stands for "change list". It's not even a tech jargon, it's a sort of Google jargon, and Google admits as much, see https://github.com/google/eng-practices.
Aha! Thanks. (To save others a click: "change list" really just means "change" or "commit": a single thing checked into version control or submitted for review.) I'm not sure the joke really lands for me -- maybe I'm stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people's code terribly hard work. Normally I'd use the obvious heuristic that the intended meaning is the one that's funny, but unfortunately none of them seems very funny to me. I guess probably it's #2?
(This is the difficulty about making up one's own jokes for this sort of test...)
I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.
The fact that we humans are having trouble understanding this joke does not bode well for its use as an AI benchmark…
Since it was the intern's last day, they might have been less careful with their coding (or, depending on why they're leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.
Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results... It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?
If I was asked the question, I think I would have said something closer to GPT-3 on Input E (though I would give the reason for making the inference she's on an airplane as being because she's looking down at clouds, not because she's looking out a window), as opposed to PaLM's response.
This isn't directly the case here, but thinking about this made me realize that in some sense, a flawed answer which is more human-like is a better answer than one which is perfect (because the flawed human response would be a more likely completion of the text). Considering that, I'm not sure if it would even be possible to utilize any future iteration of this sort of architecture to get it to answer in a significantly "superhuman" manner. It would become the perfect mimic, but can text completion bots ever go beyond that?
The "inference" "We can also infer that she is traveling at a high speed because she is unbuckling her seatbelt." is also nonsensical. People don't typically unbuckle their seatbelts when traveling at high speed. (Albeit, this does maybe happen to be true for airplane travel because one isn't allowed to unbuckle one's seatbelt while traveling at low speed, i.e. during taxi, takeoff and landing; but that's enough of a non-central case that it needs to be called out for the reasoning not to sound absurd.)
Why is it a non-central example when this is, in fact, about commercial airplane travel where you will be moving fastest at cruising altitude and that is when you're allowed to unbuckle and move about the cabin?
I think I have that intuition because the great majority of seatbelt unbucklings in my experience happen while traveling at a speed of zero (because they're in cars, not planes). The sentence has no cues to indicate the unusual context of being in a plane (and in fact, figuring that out is the point of the example). So my mental process reading that sentence is "that's obviously false" -> "hmm, wonder if I'm missing something" -> "oh, maybe in a plane?" and the first step there seems a lot more reliable (in other reasoners as well, not just me) than the second or third.
Are PaLM outputs cherry-picked?
I reread the description of the experiment and I'm still unsure.
The protocol is on page 37 goes like this:
- the 2-shot exemplars used for few-shot learning were not selected or modified based on model output. I infer this from the line "the full exemplar prompts were written before any examples were evaluated, and were never modified based on the examination of the model output".
- greedy decoding is used, so they couldn't filter outputs given a prompt.
What about the queries (full prompt without the QAQA few-shot data part)? Are they included under "the full exemplar prompts" or not? If they are there's no output selection, if they aren't the outputs could be strongly selected with the selection magnitude unreported. On one hand, "full prompts" should refer to full prompts. On the other hand, they only use "exemplar" when talking about the QAQA part they prepend to every query versus "evaluated example" meaning the query.
Google recently released a very intriguing paper announcing their latest Transformer language model, PaLM, and showcasing it responding to prompts with seemingly unprecedented results. I highly encourage those who are interested to read the original paper. What interested me reading through it was less the technical details of implementation, but more so the particular prompts used to elicit information from the model. The sorts of prompts and exemplars used seem to utilize fairly new strategies, and I thought it would be worth trying them on GPT3, to allow for a more direct comparison.
Note that as of writing this introduction I have not yet tested anything, to help guard against cherrypicking results. My plan is to feed the exemplars and prompts as written in the paper to the OpenAI GPT-3 Playground with all settings at default (using "complete" mode, with text-davinci-002), with the exception of changing "Maximum length" to 1,000 tokens, and having the "Best of" parameter set to 10. I will share the first result I get, regardless of quality. If it repeats itself or gives pure gibberish I may simply indicate that in brackets to save space; otherwise this will be verbatim.
2-shot exemplars
There are two primary "tasks" that Google tries on PaLM that I will replicate with GPT-3: “Explaining a Joke” and “Inference Chaining.” Within each task, Google uses identical 2-shot exemplars, then prompts with the specific input they want PaLM to respond to. For "Explaining a Joke," Google's 2-shot exemplar is as follows:
For Inference Chaining," the 2-shot exemplar is:
Since all prompts I will test begin with either of these two, I will not repeat them, but will instead group the two tasks separately, with the understanding that all tests are preceded by the appropriate exemplar.
Explaining a Joke
Input 1:
PaLM's output:
GPT-3's output:
Input 2:
PaLM's output:
GPT-3's output:
Input 3:
PaLM's output:
GPT-3's output [note that it flagged this output as potentially "sensitive content"]:
Input 4:
PaLM's output:
GPT-3's output:
Input 5:
PaLM's output:
GPT-3's output:
Input 6:
PaLM's output:
GPT-3's output:
I ran it and got a message saying "The model predicted a completion that begins with a stop sequence, resulting in no output. Consider adjusting your prompt or stop sequences." I then decided to deviate a little from my original plan, and added the word "Explanation:" below the prompt. Once I did that, I got the following output:
Inference Chaining
Input A:
PaLM's output:
GPT-3's output:
Input B:
PaLM's output:
GPT-3's output:
Input C:
PaLM's output:
GPT-3's output:
Input D:
PaLM's output:
GPT-3's output:
Input E:
PaLM's output:
GPT-3's output:
I hope this was helpful and/or insightful to you guys! I'm really tired right now, so I won't add my personal thoughts here, though I might put something in the comments tomorrow. As always, I'd love to hear your thoughts on this!