I am an engineer and entrepreneur trying to make sure AI is developed without killing everybody. I founded and was CEO of the startup Ripe Robotics from 2019 to 2024. | hunterjay.com
One year and 3 months on, I'm reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.
Rest of 2023
Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right -- of course models would get better! Let’s see how predictions further out did:
2024:
GPT-3.5 | GPT-4 | DeepSeek-V3 (Open Weights) | Sonnet 3.5 (New) | o1 | o3 | |
Context Length | 16k | 8k | 128k | 200k | 128k | / |
HumanEval | 48.1% | 67% | / | 93.7% | / | / |
ARC-AGI | <5% [1] | <5% [1] | / | 20.3% | 32% | 88% |
SWE-bench Verified | 0.4% [2] | 2.8% 22.4% [3] | 42.0% | 49.0% 53.0% [4] | 48.9% | 71.7% |
Codeforces [5] | 260 ~1.5% | 392 4.0% | ~1550 51.6% | ~1150 20.3% | 1891 ~91.0% | 2727 ~99.3% |
GPQA Diamond | / | 33.0% | 59.1% | 58.0% 65.0% [6] | 78.0% | 87.7% |
MATH | / | 52.9% | 90.2% | 78.3% | 94.8% | / |
MMLU | 70.0% | 86.4% | 88.5% | 88.3% | 92.3% | / |
DROP | 64.9 | 80.9 | 91.6 | 87.1 | / | / |
GSM8K | 57.1% | 92.0% | / | 96.4% | / | / |
[1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”
[2] 0.4% with RAG, tested October 2023
[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.
[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1
[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.
[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).
https://x.com/OpenAI/status/1870186518230511844
https://openai.com/index/learning-to-reason-with-llms/
https://www.anthropic.com/news/3-5-models-and-computer-use
https://arxiv.org/pdf/2303.08774v5
True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
* It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
You might be right -- and whether the per-dollar gains were higher or lower than expected would be interesting to know -- but I just don't have any good information on this! If I'd thought of the possibility, I would have added it in Footnote 23 as another speculation, but I don't think what I said is misleading or wrong.
For what it's worth, in a one year review from Jacob Steinhardt, increased investment isn't mentioned as an explanation for why the forecasts undershot.
10x per year for compute seems high to me. Naïvely I would expect the price/performance of compute to double every 1-2 years as it has been forever, with overall compute available for training big models being a function of that + increasing investment in the space, which could look more like one-time jumps. (I.e. a 10x jump in compute in 2024 may happen because of increased investment, but a 100x increase by 2025 seems unlikely.) But I am somewhat uncertain of this.
For parameters, I definitely think the largest models will keep getting bigger, and for compute to be the big driver of that -- but also I would expect improvements like mixture of experts models to continue, which effectively allow more parameters with less compute (because not all of the parameters are used at all times). Other techniques, like RLHF, also improve the subjective performance of models without increasing their size (i.e. getting them to do useful things rather than only predict what next word is most likely).
I guess my prediction here would be simply that things like this continue, so that in 2025 if you have X compute, you could get a better model in 2025 than you could in 2023. But you also could have 5x to 50x more compute in 2025, so you have the sum of those improvements!
It's obviously far cheaper to play with smaller models, so I expect lots of improvements will initially appear in models small-for-their-time.
Just my thoughts!
I wrote this late at night, so to clarify and expand a little bit;
- "Work on more than one time scale" I think is actually an interesting idea to dwell on for a second. Like, when a person is trying to solve a problem, they will often pace back and forth, or talk, etc. They don't have to do everything in one pass, somehow the complex computation which lets them see and move around can work on a very fast time scale, while other problem solving is going on simultaneously, and only starts to effect motor outputs later on. That's interesting. The spinal cord doing processing independent of the brain thing I mentioned is evident in this older series of (rather horrible) experiments with cats: https://www.jstor.org/stable/24945006
- On the 'smaller models with lower latency', we already now see models like Minstral-7b outperforming 30b parameter models because of improvements in data, architecture, and training. I expect this trend to continue. If the largest models are capable of operating a robot out of the box, I think you could take those outputs, and use them to train (or otherwise distill down) the larger model to a more manageable size, more specialised for the task.
- On the 'LLMs could do the parts with higher latency', just yesterday I saw somebody do something like this with GPT-4V, where they periodically uploaded a photograph of what was in front of them, and got GPT-4V to output instructions on how to find the super market (walk further forward, turn right, etc). Kind of worked, that's the sort of thing I was picturing here, leaving much more responsive systems to handle the low latency work, like balance, gripping, etc.
I'm somewhat skeptical that running out of text data will meaningfully slow progress. Today's models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.
Also, as you say;
- Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming)
- Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either -- it could be more like playing games or using a computer).
- And multimodal inputs gives us a lot more data too, and I think we've only really scratched the surface of multimodal transformers today.
I am honestly very surprised it became a front page post too! It totally is just speculation.
I tried to be super clear that these were just babbled guesses, and I was mainly just telling people to try to do same, rather than trusting my starting point here.
The other thing that surprised me is that there haven't been too many comments saying "this part is off", or "you missed trend X!". I was kind of hoping for that!
Agree on lower depth models being possible, a few other possibilities:
Smaller models with lower latency could be used, possibly distilled down from larger ones.
Compute improvements might make it practical onboard (like with Tesla's self-driving hardware inside the chest of their andriod).
New architectures could work on more than one time scale -- kind of like humans do. E.g. when we walk, not all of the processing is done in the brain. Your spinal cord can handle a tonne of it autonomously. (Will find source tomorrow).
LLM-type models could do the parts that can accept higher latency, leaving lower level processes to handle themselves. Imagine for a household cleaning robot that a LLM based agent puts out high level thoughts like "Scan the room for dirty clothes. ... Fold them. ... Put them in the third draw", and existing low level stuff actually carried out the instructions. That's an exaggerated example, but you get the idea, it doesn't have to replace the PID controller!
I am extremely worried about safety, but I don't know as much about it as I do about what's on the edge of consumer / engineering trends, so I think my predictions here would be not useful to share right now! The main way it relates to my guesses here is if regulation successfully slows down frontier development within a few years (which I would support).
I'm doing the ARENA course async online at the moment, and possibly moving into alignment research in the next year or two, so hoping to be able to chat more intelligently on alignment soonish.
I broadly agree. I think AI tools are already speeding up development today, and on reflection I don't actually think AI being more capable than humans at modeling the natural world would be a discontinuous point on the ramp up to superintelligence, actually.
It would be a point where AI gets much harder to predict, though, which is probably why it was on my mind when I was trying to come up with predictions.
I agree, I definitely underestimated video. Before publishing, I had a friend review my predictions and they called out video as being too low, and I adjusted upward in response and still underestimated it.
I'd now agree with 2026 or 2027 for coherent feature film length video, though I'm not sure if it would be at feature film artistic quality (including plot). I also agree with Her-like products in the next year or two!
Personally I would still expect cloud compute to still be used for robotics, but only in ways where latency doesn't matter (like a planning and reasoning system on top of a smaller local model, doing deeper analysis like "There's a bag on the floor by the door. Ordinarily it should be put away, but given that it wasn't there 5 minutes ago, it might be actively used right now, so I should leave it..."). I'm not sure the privacy concerns will trump convenience, like with phones.
I also now think virtual agents will start to become a big thing in 2025 and 2026, doing some kinds of remote work, or sizable chucks of existing jobs autonomously (while still not being able to automate most jobs end to end)!