Artificial eXpert Intelligence and Artificial Super Intelligence. Sorry for not being clear- edited title to be more obvious.
What I'm going for here is that the few shot examples show GPT-3 is a sort of DWIM AI (Do What I Mean): something it has implicitly learned from examples of humans responding to other human's requests. It is able to understand simple requests (like unscramble, use word in sentence, etc.) with about as much data as a human would need: understanding the underlying motive of the request and attempting to fulfill it.
On firefigh...
The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in 'pattern recognition' so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning...
I feel the few shot arithmetic results were some of the most revolutionary results we've received this decade. Learning math from nothing but verbal examples shows that we have an agent that can reason like us. The progression to complex algebra seems inevitable, with nothing but more parameters.
Assumption #2 is entirely correct, which is why few shot examples matter. The system is literally smart enough to figure out what addition is with those 50 examples. I bet most of us took longer than 50 examples.
" But #2 is wild: it would represent ...
The data for GPT-2 has been replicated by the open source OpenWebText project. To my knowledge the same dataset was utilised for GPT-3, so accessing it is not a problem.
The parallelizability of GPT-3 is something I've been looking into. The current implementation of zero-2 seems like the best way to memory-optimally train a 170B parameter transformer model.