Okay, you raise a very good point. To analogize to my own brain: it's like noticing that I can multiply integers 1-20 in my head in one step, but for larger numbers I need to write it out. Does that mean that my neural net can do multiplication? Well, as you say, it depends on n.
it's easy to imagine a huge LLM capable doing 500 iterations of SH1 of small strings in one shot
Nitpick: for SHA1 (and any other cryptographic hash functions) I can't fathom how an LLM could learn it through SGD, as opposed to being hand coded. To do SHA1 correctly you need to implement its internals correctly; being off a little bit will result in a completely incorrect output. It's all or nothing, so there's no way to gradually approach getting it right, so there's no gradient to descend.
But your overall point still stands. It is theoretically possible for a transformer to learn any function, so this is not a fundamental upper bound, and you are therefore correct that a large enough LLM could do any of these for small n. I wonder if this is going to be a real capability of SOTA LLMs, or will it be one of those "possible in theory, but many orders of magnitude off in practice" things.
Ultimately the question I'm thinking towards is whether an LLM could do the truly important/scary problems. I care less whether an LLM can multiply two 1Mx1M (or 3x3) matrices, and more whether it can devise & execute a 50-step plan for world domination, or make important new discoveries in nanotech, or make billions of dollars in financial markets, etc.
I don't know how to evaluate the computational complexity of these hard problems. I also don't know whether exploring that question would help the capabilities side more than the alignment side, so I need to think carefully before answering.
I think your definition of LLM is the common one. For example, https://www.lesswrong.com/posts/KJRBb43nDxk6mwLcR/ai-doom-from-an-llm-plateau-ist-perspective is on the front page right now, and it uses LLM to refer to a big neural net, in a transformer topology, trained with a lot of data. This is how I was intending to use it as well. Note the difference between "language model" as Christopher King used it, and "large language model" as I am. I plan to keep using LLM for now, especially as GPT refers to OpenAI's product and not the general class of things.
Thank you for trying so hard to replicate this little experiment of mine.
Perhaps you sent the prompt in the middle of a conversation rather than at the beginning? If the same list was also sent earlier in the conversation, I can imagine it managed to get the answer right because it had more time to 'take in' the numbers, or otherwise establish a context that guided it to the right answer.
Yes, this is exactly what I did. I made sure to use a new list of numbers for each question – I'd noticed that it would remember previous answers if I didn't – but I didn't ask each of these in its own conversation. On one hand, it would have been cleaner if I'd started fresh each time; on the other, I wouldn't have had that great moment where it wrote code and then hallucinated that it had executed it.
I hadn't noticed your patterns when I ran mine. What I did notice is that its answer always included the first element of my list, usually (~80%) included the second, the third was often (~40%) halfway through the list, and the last one was always a hallucination. That's eerily similar to what I'd expect from the four-nested-loop algorithm (minus the hallucination).
That would explain why, when I asked it to find the "closest" sum, it picked three numbers, and then, upon their being too small to work, picked the second-largest number from the list. It would also explain why the hallucinated number was the last number every single time.
I agree with this.
Sometimes it gives what might be the laziest "breakdown" I've ever seen.
I laughed hard at this.
This makes sense. Another thought I had was that I picked 50 numbers 1-100. It could've just gotten lucky. If I were to do this again, I'd do 1-1000 to decrease the odds of this.
Oh this is funny. It told me that it ran the code and got the answer [64, 91, 39, 47]. I checked that these satisfied the problem. But I didn't check (until reviewing other comments) whether that's actually what the code outputted. It's not. Technically the code actually doesn't output anything, it saves the result to a variable instead. And if I print that variable, it found [64, 6, 96, 75].
Lesson 1: I was not careful enough in checking its output, even when I thought I was being careful.
Lesson 2: It is indeed not running code, even if it tells me it is.
Thanks for the deep and thoughtful comment. I hadn't considered that "In general, you can always get a bigger and bigger constant factor to solve problems with higher n."
I'm going to think carefully and then see how this changes my thinking. I'll try to reply again soon.
Thanks, this is exactly the kind of feedback I was hoping for.
Nomenclature-wise: I was using LLM to mean "deep neural nets in the style of GPT-3" but I should be more precise. Do you know of a good term for what I meant?
More generally, I should learn about other styles of LLM. I've gotten some good leads from these comments and some DMs.
Interesting! Yes, I am using ChatGPT with GPT-4. It printed out the code, then *told me that it ran it*, then printed out a correct answer. I didn't think to fact-check it; instead I assumed the OpenAI has been adding some impressive/scary new features.
This was actually my position when I started writing this post. My instincts told me that "thinking out loud" was a big enhancement to its capabilities. But then I started thinking about what I saw. I watched it spend tens of trillions of FLOPs to write out, in English, how to do a 3x3 matrix multiplication. It was so colossally inefficient, like building a humanoid robot and teaching it to use an abacus.
Then again, your analogy to humans is valid. We do a huge amount of processing internally, and then have this incredibly inefficient communication mechanism called writing, which we then use to solve very hard problems!
So my instincts point both ways on this, but I have nothing resembling rigorous proof one way or the other. So I'm pretty undecided.