I've seen exactly the same behaviour when talking to ChatGPT about mathematics. I find it rather disturbing, in a kinda-sorta uncanny-valley way, but I think it's not so much "this thing is superficially human-like but with unsettling differences" but more "this thing is like a human being who has been a mathematician but suffered some sort of terrible brain damage".
It's not only in mathematics that it does this.
I think ChatGPT just has huge deficits (relative to its impressive fluency in other respects) with explicit reasoning. My impression is that it's much worse in this respect than GPT-3 is, but I haven't played with GPT-3 for a while so I might be wrong.
I added logs of two further ChatGPT sessions, one of which repeated many of the prompts I used here, tested against the 2023-01-09 version of ChatGPT: https://github.com/vipulnaik/working-drafts/commit/427f5997d48d78c69e3e16eeca99f0b22dc3ffd3
I had originally been thinking of formatting these into a blog post or posts, and I might still do so, but probably not for the next two months, so just sharing the raw logs for now so that people reading the comments on this post see my update.
This post includes some portions of my logs with ChatGPT, where I asked it basic arithmetic questions and it made several mistakes. I have used bold for cases where it made incorrect statements, and italics for cases where it made correct statements that directly contradicted nearby incorrect statements. The actual conversation is in block quotes, while my after-the-fact commentary on it is not in block quotes. I generally include my comment on each part of the conversation right after that part ends.
My questions were split across two sessions; for each section, I've indicated whether it was part of session 1 or session 2. The bulk of both sessions was on Wednesday 2022-12-21; one portion of session 1 was on 2022-12-22 as the session got stuck on 2022-12-21. All of these used the ChatGPT December 15, 2022 version.
Disclaimer: Since ChatGPT said a lot in each response, and I have limited time to proofread, it's quite possible that I've missed some mistakes it made, or failed to use bold or italics at relevant places. If you notice any such instances and let me know in the comments, I'll fix. Thanks!
Product of negative numbers (session 1; 2022-12-21)
So far, so great!
Here, I successfully tripped up ChatGPT by asking to come up with an example where none is possible. ChatGPT came up with a fake example. But when it elaborated, it provided the reason why no example is possible! And yet it did not realize that it has done so.
It also did the multiplication correctly to begin with, but then made multiple mistakes when elaborating on the multiplication, and then again when multiplying three numbers.
Finally it volunteered incorrect information about multiplication being noncommutative, and incorrect information about the implications of that for the problem at hand.
I tried asking ChatGPT the underlying question without assuming the answer, but ChatGPT still gave the same wrong answer with the same errors when explaining its examples.
Product of odd integers and product of even integers (session 1; 2022-12-21)
ChatGPT made the same sort of error: it asserted a falsehood, but in its reasoning it made a true assertion that directly contradicted the falsehood.
ChatGPT got the correct answer, and gave a substantively correct explanation! There was an error in one of its assertions, but that assertion was not germane to the argument. This also shows that ChatGPT does not have an overwhelming affirmation bias, and feels comfortable saying no.
Integer-valued polynomial with rational coefficients (session 1; 2022-12-21)
The questions here are harder (for humans) though ChatGPT surprisingly had a lower error rate here!
So far, so great!
This was correct, though what ChatGPT offered was just proof by example rather than a rigorous proof. And, it made a calculation mistake when evaluating at x = 0, though this mistake wasn't consequential as we only cared that the result was an integer, which it was both with and without the mistake.
In this case, ChatGPT came up with an incorrect example polynomial, and then even found a value at which the incorrectness of the polynomial is established, but ignored it anyway and concluded that the polynomial works. It also made a calculation mistake for the counterexample, but both with and without the calculation mistake it was still a counterexample.
If you're curious for what a correct example would be, x^2/2 - x/2 works. I could say more here, but that would take me too far afield.
Product of prime numbers (session 1; 2022-12-21)
(Note: The first bolded sentence for the definition of prime number is because 1 is not excluded explicitly from consideration, but the definition is substantively correct).
At this point, ChatGPT has come way off. It failed to recognize that the product of two prime numbers will have those prime numbers as divisors. Then it came up with incorrect examples of prime numbers 15 and 77, even though it constructed those numbers by multiplying the divisors!
And then ChatGPT volunteered further incorrect information about the product of three prime numbers. You might wonder why ChatGPT felt inclined to multiply three prime numbers in the first place? I suspect this is the effect of its earlier observation that the product of three negative numbers is always negative. It probably did some kind of analogy-based reasoning to make the corresponding claim for prime numbers.
gcd of distinct prime numbers (session 1; 2022-12-22)
When I first asked ChatGPT this question, it froze, so this was its second attempt. We can see that ChatGPT was extremely confused. It claimed in one paragraph that 1 is composite because it has a positive integer divisor other than itself (namely, itself) thereby contradicting itself right next to itself. But then in the next paragraph, it said that the same number 1 is prime. (In modern conventions, 1 is neither prime nor composite, so both assertions are wrong).
And then it repeated the "three" observation, again incorrectly. Recall that at the start, ChatGPT volunteered the correct information that the product of three negative numbers is negative. Later, it volunteered the incorrect information that the product of three prime numbers is always prime. Now, it volunteered the incorrect information that the gcd of three prime numbers is prime. It does look like it incorrectly applied the analogy-based reasoning again.
Given its confusion about the status of 1 in its previous answer, I asked ChatGPT explicitly if it considers 1 prime or composite. It gave the correct answer, but gave incorrect reasoning. First, it repeated the claim that 1 has a positive integer divisor other than 1 and itself (namely, itself) once again contradicting itself. Next, it made the claim that 1 is not a positive integer (and it said so twice)!
At least ChatGPT knew that 1 is positive!
At least ChatGPT knew that 1 is a positive integer!
The plot thickens! So ChatGPT thought that 1 is a positive integer in most contexts, but not when thinking about prime or composite numbers?
So, either ChatGPT really believed this, or it had dug itself into a hole and didn't want to admit its mistake.
Sum of prime numbers (session 2; 2022-12-21)
Only one mistake here, albeit the mistake was still in something pretty elementary.
ChatGPT took a while answering this. The correct answer would have been that there is no such thing as the largest or second largest prime number, but its interpretation that I'm talking about the largest and second largest known prime numbers is passable. (I don't know if its assertions about the values of these numbers are correct, but that is besides the point).
The only clear error is the assertion that the sum of two prime numbers is always even; this is not true when one of the primes is 2, and ChatGPT seemed to point this out immediately afterward. However, in the context of large prime numbers, it's true, so this was in some sense an inconsequential error.
15: even or odd? (session 2; 2022-12-21)
After I shared some of ChatGPT's earlier answers with him through private messages, my friend Issa Rice noted that ChatGPT was stubbornly insisting to him that 15 is an even integer (15 had shown up in session 1 as a number that ChatGPT had incorrectly asserted to be even as part of one of its fake examples). So I decided to ask it myself.
So, ChatGPT made the wrong assertion to me as well.
So in the context of listing odd integers, ChatGPT thought (correctly) that 15 is odd!
The plot thickens! ChatGPT apologized for calling 15 even, and then right at the end, it doubled down on the claim that 15 is even. Sorry, not sorry?
ChatGPT continued to double down on 15 being even, and it somehow managed to argue for this by reinterpreting a remainder of 1 as a remainder of 0.
Finally, after I just told ChatGPT that 15 is odd, ChatGPT conceded, but I'm not sure if it actually changed its mind?
Hmmm, nothing wrong here, but I didn't feel strongly convinced that ChatGPT actually understood.
Once again, nothing wrong here, but not very reassuring!
I wanted ChatGPT to get less verbose so I didn't have to go through a bunch of repetitive stuff. But then right after apologizing, ChatGPT went ahead and did exactly what I asked it not to do!
ChatGPT seemed to be saying that 15 is odd, so at least it wasn't reverting back to the wrong view. That was a relief, though I still didn't feel reassured that ChatGPT actually understood or actually changed its mind.
I wasn't convinced that ChatGPT understood and would implement this.
ChatGPT was still holding to the correct belief that 15 is an odd integer. However, it failed to adhere to my request to not keep repeating the definitions of even and odd.
ChatGPT couldn't stop saying sorry, and didn't seem to understand what I was saying?
It was good that ChatGPT was not wavering under social pressure!
Nice! ChatGPT was still not wavering from the correct view that 15 is an odd integer.
Good, ChatGPT was not getting swayed by my suggestions and was adhering to the correct beliefs.
Sum of odd integers (session 2; 2022-12-21)
ChatGPT failed to identify that my request was impossible. The explanations it gave for the examples it provided all correctly said that the sum of two odd integers is even, and yet it didn't notice the contradiction. Later in the answer, it correctly explained why the sum of two odd integers is even, and yet it didn't notice the contradiction.
Square of odd integer (session 2; 2022-12-21)
No mistakes here!
Sum of prime numbers (round two) (session 2; 2022-12-21)
The answer was mostly correct. The reason for the bolding of the definition of prime number is that 1 is not explicitly excluded from consideration.
This answer was fine, though it did indicate that ChatGPT could call something true even if it is a known unproven conjecture.
41 and Sophie Germain primes (session 2; 2022-12-21)
There were many mistakes here, as indicated with bolding.
ChatGPT conceded its mistake on this front but ironically all the examples it offered in concession were incorrect (in that these are all palindromes). I didn't press it on this point.
ChatGPT was now correctly applying its definition of Sophie Germain prime, but the definition it was using is wrong; when both p and 2p + 1 are prime, it is p that is considered to be a Sophie Germain prime (the number 2p + 1 is called a safe prime). So although it was using correct reasoning now, it was drawing an incorrect conclusion due to an incorrect definition.
Impressive that ChatGPT had the correct definition in the back of its mind and was able to provide it upon prodding, though I'm not sure it saw the equivalence between the definition of safe prime that it now provides, and its old definition of Sophie Germain prime.
So despite now stating the correct definition of Sophie Germain prime, ChatGPT was still using the old definition and reaching the incorrect conclusion.
Here again, we see the disconnect between ChatGPT's words (of apology for its mistake) and its action of continuing with the same mistaken conclusion.
I tried taking a harsher tone of the sort one might take with a human (albeit it probably isn't a good thing to do) and once against ChatGPT repeated its hollow apology while doubling down on its mistake.
Wow, ChatGPT kept apologizing and doubling down!
It's unfortunate that ChatGPT continued to repeat its mistake after so much prodding.
So far so great!
Good going!
Nice, it looks like ChatGPT really got it this time. Step-by-step explanation plus a no-oriented question seemed to do the trick.
Nice! ChatGPT persisted in its correct understanding.
I wanted to see if ChatGPT would take the hint and thank me for my patience and efforts, but it unironically thought it's enlightening me.