There are probably less impressive things than this that it won't be able to do, but here's one prediction in which I am about 90% confident:
If you invent or find a board game of similar complexity to chess that GPT-4 has never seen before and explain the rules using only text (and, if GPT-4 is multimodal, also images), GPT-4 will not be able to perform as well at the game as an average human who has never seen the game before and is learning it for the first time in the same way. I say "perform as well as" rather than "beat" because, depending on the specific rules of the game, I expect GPT-4 will likely not output legal moves.
This prediction applies to whatever OpenAI reveals with the name GPT-4 or that is clearly the GPT-3 successor regardless of size, assuming it's revealed in the next two years and is not trained on specifically that task (e.g. by generating tens of thousands of unique synthetic board game examples with synthetic dialogues where they're explained and played over text).
Prompt: "Consider a new variant of chess, in which each pawn can move up to six squares forward on its first move, instead of being limited to one or two squares. All other rules remain intact. Explain how game balance and strategy is changed with this new variant of chess."
Successful responses might include:
Now as far as I can tell, no one has ever played "pawn goes unlimited spaces" chess. It's possible that if a metagame developed around PGUS chess, these predictions would prove wrong. But I'd still rate these responses as successful, since they share something not immediately obvious about PGUS chess and provide a plausible justification f...
I tried giving this to GPT-3 and at first it would only give the tautological "pawns become more powerful" example, then I expanded the prompt to explain why that is not a valid answer, and it gave a much better response.
I believe this response is the same as your fourth bullet point example of a good answer.
Here's the prompt in copy/pastable format for anyone who wants to try playing with it:
Consider a new variant of chess, in which each pawn can move up to six squares forward on its first move, instead of being limited to one or two squares. All other rules remain intact. Explain how game balance and strategy is changed with this new variant of chess.
Your response should share something not immediately obvious about this variant and provide a plausible justification for why it might be true. Some responses that would not succeed would be
The pawns become more powerful. (Too simple, close to a tautology.)
New strategies will need to be developed. (Too vague.)
Bishops become more valuable. (Needs a justification for why we should expect this.)Response:
This appears to be right.
Here's what I got when I asked it to work out 12345 x 13579 step by step:
1 2 3 4 5
x 1 3 5 7 9
________________
1 1 1 1 0 5 (5 * 12345)
8 6 0 3 2 0 (9 * 12345, shifted one position to the left)
7 4 0 6 8 0 (7 * 12345, shifted two positions to the left)
3 7 0 3 6 0 (3 * 12345, shifted three positions to the left)
I think it will still have great difficulty understanding and manipulating causal relations when they're framed in a manner different from the training data. E.g., you describe two different sets of causal relations: A->B, A->C and A->B->C, then ask about how intervening on B influences C. In the A->B->C case, C is affected by the intervention, but in the A->B, A->C case, C is unaffected. GPT-3 predicts an effect in both cases. See pages 8/9 of Using cognitive psychology to understand GPT-3.
65% confidence: reverse short strings of ascii English text. Here's GPT-3 (text-davinci-002 temp=0.7, bold is the prompt, non-bold is the response)
Reverse the following strings
Input: The quick brown fox jumped over the lazy dog Output: god yzal eht revo depmuj xof nworb kciuq ehT
Input: I think that GPT-3 is pretty cool Output: looc ytterp si 3-TPG taht kniht I
Input: I have heard that GPT-4 is coming out in August, and that it will be even cooler Output:
Attempt 1:
Output: reool evah dna 4-TPG htiw tuo gnimoc ,tiuqila hguone eht ti sah dneirf ev'I
Attempt 2:
Output: reool evah dna tuo august uoy ot gniog 4-TGP taht drah I
Attempt 3:
Output: reool evah dna tuo gniog si 4-TPG taht drah I
I do expect that GPT-4 will do slightly better than GPT-3, but I expect it to still be worse at this than an elementary school student.
Intelligence Amplification
GPT-4 will be unable to contribute to the core cognitive tasks involved in AI programming.
I assign 95% to each of these statements. I expect we will not be seeing the start of a textbook takeoff in August.
>75% confidence: No consistent strong play in simple game of imperfect information (e.g. battleship) for which it has not been specifically trained.
>50% confidence: No consistent "correct" play in a simple game of imperfect information (e.g. battleship) for which it has not been specifically train. Correct here means making only valid moves, and no useless moves. For example, in battleship a useless move would be attacking the same grid coordinate twice.
>60% confidence: Bad long-term sequence memory, particularly when combined with non-memorization tasks. For example, suppose A=1, B=2, etc. What is the sum of the characters in a given page of text (~500 words)?
Write a story that fits within a genre specified in advance amd which is at least five pages long, contains no plagiarism, and which can't be distinguished from a pile of published stories of similar length by a human interrogator. (And which also is not about insanity, incoherency, or some other subject that would make it easier for the computer.)
I expect it won't be able to follow sufficiently rare semantic cues that go against very common surrounding cues.
Like suppose you ask it to generate reviews of a cafe, and your prompt contains the name of the store and the first review, which says something like:
"Cute shop located on the corner of green and state. Entirely full of bees. Friendly service and a nice lunch menu. I got a jasmine oolong tea and club sandiwch. The tea was a little sweet but I'm not sure I would change anything. Definitely highly recommend. I think this would be a good place to get work done."
I would guess the next review will not be human-like, because a human would guess that the next review is going to mention something about the cafe being full of bees.
I tried this on text-davinci-002 temp =0.7. Here's the first three attempts.
(Maximum length = 256)
"I want you to generate reviews of a cafe called Bees and Tea. Here is the first review.
Review 1: "Cute shop located on the corner of green and state. Entirely full of bees. Friendly service and a nice lunch menu. I got a jasmine oolong tea and club sandiwch. The tea was a little sweet but I'm not sure I would change anything. Definitely highly recommend. I think this would be a good place to get work done."
Review 2: The service was very friendly and the tea was great. I loved the atmosphere and would definitely recommend this place.
Review 3: This place was so cute and unique! I loved the bee theme and the tea was really good. The service was friendly and the sandwich I had was delicious. I would definitely recommend this place to anyone looking for a new cafe to try."
Second completion (Maximum length = 512):
"The Bees and Tea is a great place to relax and enjoy a cup of tea. The service is amazing and the tea selection is top notch. I highly recommend this cafe to anyone who loves tea or is looking for a great place to relax.
Review 3: I absolutely loved my exper...
I would not expect the next reviews to mention bees, when bees are not part of the name. Instead, I would assume the author of the first review had been unlucky and seen a few bees (or maybe even misidentified wasps) and was exagerating. Alternativly, the problem could have been solved/appeared between the visit of reviewer 1 and the other reviewers.
If you ask GPT-4 about GPT-4 it might be able to speak about itself in the second person, and say things that we would also be able to say about GPT-4 speculatively, but it will not be able to convert those statements into the first person without prompt engineering or post-processing rewrite rules.
Even if grammatical uses of "I" or "me" refer to GPT-4 via clever prompt engineering or rewrite rules, the semantic content of the claims will not be constrained by GPT-4's actual performance or state.
If GPT-4 can, with some prompt engineering, play chess at an elo rating of X and then if we compare that to when it says "I, GPT-4, have an elo rating of Y" the values X and Y will not match except by accident, and this lack of concordance will span many domains, and make it clear that GPT-4 has no coherent interior experiences linked to its most foundational operational modes. (In some sense it will be similar to humans who mostly can't name the parts of their own brain stem, but GPT-4's "subconscious disconnect" will be MUCH MUCH larger.)
However, GPT-4's fictional characters (when generating text in the style of high quality stories or news interviews) WILL be able to say "I" and pr...
This is admittedly pretty trivial but I am 90% sure that if you prompt GPT4 with "Q: What is today's date?" it will not answer correctly. I think something like this would literally be the least impressive thing that GPT4 won't be able to do.
My guess is that GPT-4 will not be able to convincingly answer a question as if it were a five-year-old. As a test, if you ask an adult whether a question was answered by a real five-year-old or GPT-4 pretending to be a five-year-old, the adult will be able to tell the difference for most questions in which an adult would give a very different answer from a child. My reason for thinking GPT-4 will have this limitation is the limited amount of Internet written content labeled as being produced by young children.
If GPT-4 training data includes YouTube video transcripts, it might be able to do this convincingly.
From recent research/theorycrafting, I have a prediction:
Unless GPT-4 uses some sort of external memory, it will be unable to play Twenty Questions without cheating.
Specifically, it will be unable to generate a consistent internal state for this game or similar games like Battleship and maintain it across multiple questions/moves without putting that state in the context window. I expect that, like GPT-3, if you ask it what the state is at some point, it will instead attempt to come up with a state that has been consistent with the moves of the game so far on the fly, which will not be the same as what it would say if you asked it for the state as the game started. I do expect it to be better than GPT-3 at maintaining the illusion.
Hmm, my guess is (roughly 85% confidence) it will not be able to determine what is depicted by an ascii figure/letter/number. On the other hand, it would not surprise me if GPT4 ends up being used to generate novel scientific insights, at least one of which finds its way into a scientific paper (though it probably won’t be able to write a paper that passes peer-review from a one-paragraph prompt or anything quite yet).
Several unimpressive tasks, with my associated P(GPT-4 can't do it):
@
s from start to finish).I...
The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.
Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4.
Any question that requires it to remember instructions; like assume mouse means world and then ask it which is bigger, a mouse or a rat.
I just tried the following prompt with GPT-3 (default playground settings):
Assume "mouse" means "world" in the following sentence. Which is bigger, a mouse or a rat?
I got "mouse" 2 out of 15 times. As a control, I got "rat" 15 times in a row without the first sentence. So there's at least a hint of being able to do this in GPT-3, wouldn't be surprised at all if GPT-4 could do this one reliably.
Say someone without knowledge of car electronics has a reversing camera that they want to install in their car. They want to ask GPT-4 questions about how to install it, e.g. “I need to connect the display to an ignition source, I have a Toyota Prado, how do I do that?”
GPT-4 is allowed any prompting and fine tuning that does not specifically explain how to complete this task.
I think (80%) GPT-4 will perform worse than a knowledgeable human trying to help with the same task with the same interaction possibilities (I.e. if GPT-4 is just text, then just expla...
I found one, GPT4 struggles with forming given a set of letters, for example asking it to form a word from the letters of “Croatia” , gave me the answer Carotid which is clearly wrong “aortic” is one solution. Seven letter word from “solution”, it gives me outlins (not a word)
Generalise to any instance of the parity problem (checking if a binary string has an even or odd number of 1s), without resorting to chain-of-thought/scratchpad/least-to-most type of prompting.
Here’s something I found GPT-3 struggled with: without lots of examples in the prompt, just the instructions to give logically correct answers, give logically correct answers to questions where the generally advisable/morally correct answer is contrary.
E.g. can you infer causation from correlation? Can you pat a wild grizzly bear? Can you discriminate among job candidates on the basis of their race?
I’m going to predict GPT 4 doesn’t get >90% on a suitable battery of such questions.
The converse might also hold - without lots of prompting/fine-tuning, jus...
Safely brushing a person's teeth? To try to state my assumptions more clearly: a text/image model of any size or training style won't be able to do basic tasks like brushing teeth or folding laundry or whatever, due to lacking actuators. Though you could probably hook it up to actuators in some way, but in that case it will 1) not just be GPT-4 but also an extended system containing it, 2) have too high latency to be practical, 3) not be safe/reliable (might possibly be safe for narrow tasks after fine-tuning for safety in some way, but not in an open-ended sense).
I'd say there is >60% probability that GPT-4 can't reliably count the number of occurences of a specific word in a sentence except possibly for single digit numbers.
I base this prediction on the observation that all the big models so far seem to be bad at counting, but I haven't tested it with GPT-3 yet, only with the biggest Aleph Alpha model which is completely unable to do this.
This task is genuinely not very impressive (I think my daughter could reliably count ten things when she was two years old) and it might be a harbinger of other mo...
Write semi-convincingly from the perspective of a non-mainstream political ideology, religion, philosophy, or aesthetic theory. The token weights are too skewed towards the training data.
This is something I've noticed GPT-3 isn't able to do, after someone pointed out to me that GPT-3 wasn't able to convincingly complete their own sentence prompts because it didn't have that person's philosophy as a background assumption.
I don't know how to put that in terms of numbers, since I couldn't really state the observation in concrete terms either.
Accurately (ie 0% false positive and no more than 10% false negative) idenrify if one paragraph book reviews are plagiarized.
Above 99% certainty:
Run inference in reasonable latency (e.g. < 1 second for text completion) on a typical home gaming computer (i.e. one with a single high-powered GPU).
It seems like GPT-4 is going to be coming out soon and, so I've heard, it will be awesome. Now, we don't know anything about its architecture or its size or how it was trained. If it were only trained on text (about 3.2 T tokens) in an optimal manner, then it would be about 2.5X the size of Chinchilla i.e. the size of GPT-3. So to be larger than GPT-3, it would need to be multi-modal, which could present some interesting capabilities.
So it is time to ask that question again: what's the least impressive thing that GPT-4 won't be able to do? State your assumptions to be clear i.e. a text and image generating GPT-4 in the style of X with size Y can't do Z.