My AI Predictions 2023 - 2026

HunterJay

My AI Predictions 2023 - 2026 — LessWrong

62 My AI Predictions 2023 - 2026

16th Oct 2023

6 min read

62

Epistemic status: My, mostly intuitive, guesses - with only a few days dwelling on it, and no serious research beyond what I already knew.

I work in the startup sphere, in field robotics, and I am about to have an opportunity to majorly shift what I am working on. To work out what projects might make sense on a multi-year time frame, I wrote up what I thought might happen in AI in the next couple of years as specifically as I could.

I found the exercise surprisingly useful. It turned a whole bunch of vague "X will get better over time" to actionable "X will be practical in around Y years". I don't think my guesses will end up actually being very accurate, but having something solid forced me to actually think about the future and make my gibberish-internal-intuitions into more-consistent-guesses. I was really surprised at how much it helped, actually.

So, here's that list of predictions. I'm sharing it here more as a "here's how you can do something similar" than as a "here's a well researched report on the future trends." (which it definitely is not). I didn't go to the trouble of putting my %'s on the guesses "X in year Y", but it's about 40%-70% for any given guess.

________________________________

[Written 6th October 2023]

Rest of 2023:

Small improvements to LLMs
- Google releases something competitive to ChatGPT.
- Anthropic and OpenAI slightly improve GPT-4 and Claude2
- Meta or another group releases better open source models, up to around GPT-3.5 level.
Small improvements to Image Generation
- Dalle3 gets small improvements.
- Google or Meta releases something similar to Dalle3, but not as good.
Slight improvements to AI generated videos.
- Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.

2024

GPT-5 or equivalent is released.
- It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
- Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
- Multimodal inputs, browsing, and agents based on it are all significantly better.
Agents can do basic tasks on computers -- like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
Context windows are no longer an issue for text generation tasks.
- Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
- GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
- This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
Online learning begins -- GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
AI selection of what data to train on is used to improve datasets in general - training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
Open source models as capable of GPT-4 become available.
Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend.
- This is because of a combination of -- datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.

2025

AI agents are used in basic robotics -- like LLM driven delivery robots and (in demos of) household and factory robots, like the Tesla Bot. Multimodal models basically let them work out of the box, although not 100% reliably yet.
Trends continue from the previous year -- the time horizons agents can work on increase, LLMs improve on traditional LLM tasks, smaller models get more capable, and the best models get bigger.
AI curated and generated data becomes far more common than previously, especially for aligning models.
Virtual environments become more common for training general purpose models, combined with traditional LLM training.
Code writing AI (just LLMs with context and finetuning) are capable of completely producing basic apps, solving most basic bugs, and working with human programmers very well -- it’s pair programming with an AI, with the AI knowing all of the low level details (a savant who has memorised the docs and can use them perfectly, and can see the entire codebase at once), and the human keeping track of the higher level plan and goals. The AI can also be used to recommend architectures and approaches, of course, and gradually does more and more between human inputs.
If there ever feels like a lull in progress, it will be in this period leading up to models capable enough for robotics control, long time frame agents, and full form video generation, which I don’t expect to happen in an large scale way in 2025.
Possibly GPT-6 or equivalent is released, but more likely continuous improvements to GPT-5 carry forward. There’s not a super meaningful difference at this point, with online learning continually improving existing models.

2026

GPT-6 or equivalent capabilities are reached (i.e. as big a jump as GPT-3.5 to 4, to 5, to 6).
- Multimodal works great out of the box. The same model can do video, image, text, audio, and other analysis and generation, including outputting commands to control digital agents and robots via API calls.
- Simulated environments are used in training -- online learning inside a video game, inside a virtual machine, etc. This could be training on long sequences of pre-generated actions like with traditional LLMs learning from existing text, as well as training on sequential actions chosen by the LLM as it trains, like with reinforcement learning.
Whether from OpenAI or others, this level of LLM enables general purpose household, warehouse, and factory robots to start actually being useful for some tasks, like cleaning and sorting. They are expensive, rare, and not particularly reliable, but are being manufactured at scale by Tesla and others.
Realistic fully automated video generation is better than Dalle3 image generation, but limited to reasonably short snippets (<60s) without human intervention before it looks strange. This length quickly increases, and workarounds and human input allow long length high quality videos to be produced.
Progress appears to accelerate again, as online learning in virtual environments, generated data, and robotics systems and digital agents enter common usage.

2027 & Beyond

I struggle to imagine what the world looks like beyond this point. The above trends may continue for some time, with robotics and digital agents taking over a larger and larger share of the world.
At some point, a major step change will also happen when AI is capable of generating new major scientific breakthroughs on its own -- more akin to Einstein coming up with relativity to explain known data than akin to predicting the shape of proteins.
A massive change will come as the share of AI improvement caused by AI's own work surpasses the share caused by human work, possibly later this decade.
It seems likely to me that superintelligence -- and all of the sci-fi seeming technologies and X-risk that comes with it -- will appear soon after this period. I have a significant probability of it happening this decade. (I have previously said a 50% chance of AGI by 2029, and superintelligence very shortly afterwards, and that still feels right to me). If it doesn’t appear by then, I would expect one of the following to be true:
- Regulation significantly slows development.
- Zero algorithmic advances on the scale of transformers are developed.
- There is something unexpectedly limiting about the transition from oracle to agentic AI, and we have a huge “oracle overhang” -- where a new architecture that works well as an agent will suddenly be as capable as a million humans with all of the knowledge and skills of GPT-6-to-10, once that theoretical breakthrough happens.

Forecasting & PredictionForecasts (Specific Predictions)AI

Frontpage

62

New Comment

34 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:33 PM

[-]pathos_bot3y1816

The major shift in the next 3 years will be that, as a rule, top level AI labs will not release their best models. I'm certain this has somewhat been the case for OpenAI, Anthropic and Google for the past year. At some point full utilization of a SOTA model will be a strategic advantage for companies themselves to use for their own tactical purposes. The moment any $X of value can be netted from an output/inference run of a model for less than $(X-Y) in costs, where Y represents the marginal labor/maintenance/averaged risk costs for each run's output, no company would ever be advantaged by releasing the model to be used by anyone other than themselves. This closed-source event horizon I imagine will occur sometime in late 2024.

[-]Daniel Kokotajlo3y40

[-]Tomás B.3y30

This is a very good, and very scary point - another thing that could provide, at least the appearance of, a discontinuity. One symptom of this this scenario would be a widespread, false belief that "open source" models are SOTA.

Might be good to brainstorm other symptoms to prime ourselves to recognize when we are in this scenario. Complete hiring-freezes/massive layoffs at the firms in question, aggressive expansion into previously-unrelated markets, etc.

[-]Insub3y10

Not sure I understand; if model runs generate value for the creator company, surely they'd also create value that lots of customers would be willing to pay for. If every model run generates value, and there's ability to scale, then why not maximize revenue by maximizing the number of people using the model? The creator company can just charge the customers, no? Sure, competitors can use it too, but does that really override losing an enormous market of customers?

[-]pathos_bot3y20

That's very true, but there are two reasons why a company may not be inclined to release an extremely capable model:
1. Safety risk: someone uses a model and jailbreaks it in some unexpected way, the risk of misuse is much higher with a more capable model. OpenAI had GPT-4 for 9-10 months before releasing it trying to RHLF and even lobotomized it to being more safe. The Summer 2022 internal version of GPT-4 was, according to Microsoft researchers, more generally capable than the released version (as evidenced by the draw a unicorn test). This needed delay and assumed risks will naturally be much greater with a larger model, both in that larger models, so far, seem harder to simply RHLF into unjailbreakability, and by being more capable, any jailbreak carries more risk, thus the general business level margin of safety will be higher.

2. Sharing/exposing capabilities: Any business wants to maintain a strategic advantage. Releasing a SOTA model will allow a company's competitors to use it, test its capabilities and train models on its outputs. This reality has become more apparent in the past 12 months.

[-]Tomás B.3y20

It does seem to me a little silly to give competitors API access to your brain. If one has enough of a lead, one can just capture your competitors markets.

[-]Vladimir_Nesov3y*80

Text data is running out, the 2+ billion dollar scale training runs due in 2-4 years are going to devour the rest of it. This might be sufficient to reach AGI, in the sense of capability for mostly autonomous research, in particular development of compute multipliers for training runs and plucking the rest of low hanging fruit of the unsupervised learning revolution, overcoming scarcity of hardware.

If AGI is not in range of those runs, and if there is no synthetic data generation process useful at that scale, the bulk of compute goes to multimodality (though latency will still cripple many use cases), and the rate of competence improvement may slow for years. This is the main scenario where I see significant hope for regulation to take hold. Doing better and countering the risk of AGI in the initial rush to billion dollar scale runs requires a nebulously defined pause right now.

[-]avturchin3y20

One way to get more data is to pay humans to create the specific types of data we need. For example, if billion people write 100 pages on the unique topic of their expertise each - and the needed data generation will be controlled by AI - maybe that will be enough.

[-]HunterJay3y10

I'm somewhat skeptical that running out of text data will meaningfully slow progress. Today's models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.

Also, as you say;
- Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming)
- Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either -- it could be more like playing games or using a computer).
- And multimodal inputs gives us a lot more data too, and I think we've only really scratched the surface of multimodal transformers today.

[-]Vladimir_Nesov3y*20

New untested ideas take unpredictable time to develop. Given the current timeline of pure compute/investment scaling, there is no particular reason for all bottlenecks to be cleared just in time for scaling to continue without slowing down. Hence the possibility of it slowing down at the upcoming possible bottlenecks of natural text data and available-on-short-notice hardware, which are somewhat close together.

Sample efficiency (with respect to natural data) can in principle be improved, humans and some RL systems show it's possible, and synthetic data is a particular form this improvement might take. But it's not something that's readily available, known to subsume capabilities of LLMs and scale past them. Also, straying further from the LLM recipe of simulating human text might make alignment even more intractable. In a universe where alignment of LLMs is feasible within the current breakneck regime, the source of doom I worry about is an RL system that either didn't train on human culture or did too much reflection to remain within its frame.

Compared to natural text, multimodal data and many recipes for synthetic data give something less valuable for improving model competence, reducing return on further scaling. When competence improvement slows down, and if AGI in the sense of human-level autonomous work remains sufficiently far away at that point, investment scaling is going to slow down as well. Future frontier models cost too much if there is no commensurate competence improvement.

[-]Archimedes3y10

My hunch is that there's sufficient text already if an AI processes it more reflectively. For example, each chunk of text can be fed through a series of LLM prompts intended to enrich it, and then the model trains on the enriched/expanded text.

[-]Gordon Seidoh Worley3y62

I dislike this type of post. Predictions are nice and all, but I can take no updates from these predictions because I know nothing about the author or any reason to give their predictions credence and no real evidence is given to justify the predictions (evidence could include something like making real money bets on prediction markets, not just data and arguments!). So for me this post is nothing but speculation, and I'm disappointed that LessWrong readers voted it up.

Note: This is nothing against the post's author! I think it's totally fine for a person to speculate. I'm just both surprised this became a frontpage post and that it was voted up!

[-]HunterJay3y80

I am honestly very surprised it became a front page post too! It totally is just speculation.

I tried to be super clear that these were just babbled guesses, and I was mainly just telling people to try to do same, rather than trusting my starting point here.

The other thing that surprised me is that there haven't been too many comments saying "this part is off", or "you missed trend X!". I was kind of hoping for that!

[-]Logan Riggs3y84

I really like this post, but more for:

Babbling ideas I might not have thought of previously (e.g. the focus here on long-time horizon tasks)
Good exercise to do as a group to then dig into cruxes

than updating my own credences on specifics.

[-]Matthew Barnett3y30

I agree the author should attach credences. I'd also appreciate a little more specificity, for example with the prediction, "Google releases something competitive to ChatGPT." I'm not sure whether that refers to ChatGPT-3.5 or ChatGPT-4 and the meaning here is actually quite important.

[-]Daniel Kokotajlo3y66

oint, a major step change will also happen when AI is capable of generating new major scientific breakthroughs on its own -- more akin to Einstein coming up with relativity to explain known data than

The acceleration of AI R&D will begin sooner than that, I think. We could get 10x speedup just by automating the typical openai engineer I think.

[-]HunterJay3y40

I broadly agree. I think AI tools are already speeding up development today, and on reflection I don't actually think AI being more capable than humans at modeling the natural world would be a discontinuous point on the ramp up to superintelligence, actually.

It would be a point where AI gets much harder to predict, though, which is probably why it was on my mind when I was trying to come up with predictions.

[-]Tomás B.3y30

And OpenAI has explicitly said this is what they want to do! Their Superalignment strat looks suspiciously like "gunning for RSI".

[-]HunterJay1y31

One year and 3 months on, I'm reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.

Rest of 2023

Small improvements to LLMs
- Google releases something competitive to ChatGPT.
  - Mostly True | Google had already released Bard at the time, which sucked, but this was upgraded to Gemini and relaunched in December 2023. Gemini Ultra wasn’t released until February 2024 though, so points off for that.
- Anthropic and OpenAI slightly improve GPT-4 and Claude2
  - True | GPT-4 Turbo and Claude 2.1 were both released in November 2023.
- Meta or another group releases better open source models, up to around GPT-3.5 level.
  - False | Llama 2 had already been released at this time, and was nearly as good as GPT-3.5, but no other GPT-3.5-or-better open source models came out in 2023.
Small improvements to Image Generation
- Dalle3 gets small improvements.
  - Debatable | This is a really lukewarm prediction. Small changes were made to Dalle3 in the rest of 2023, integrating with GPT-4 prompting, for example, though there were complaints they made it worse in an attempt to avoid copyright issues when it was integrated with Bing.
- Google or Meta releases something similar to Dalle3, but not as good.
  - Mostly True | Google released Imagen 2 in December 2023, which was about as good as DALL-E 3. I don’t know how much I should penalise myself for it being about as good, rather than ‘not as good’.
Slight improvements to AI generated videos.
- Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
  - True | Lots of people played around with making videos by stepping through frames made in DALL-3, and they mostly weren’t very good! Pika 1.0 came out in December 2023, but it also wasn’t that great.
Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
- True | Figure AI is the most notable example of hooking up LLMs to robotics, and they did some experiments in late 2023 with GPT-4. As far as I know there wasn’t any commercial release of an LLM-enabled robot anywhere.
Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.
- True | Mistral 7B was notable here, being smaller and more capable than some of the earlier, much larger models like BLOOM 176B (as far as I can tell).

Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right -- of course models would get better! Let’s see how predictions further out did:

2024:

GPT-5 or equivalent is released.
- It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
  - Mostly True | While they aren’t named GPT-5, the best released models today are as big an improvement over GPT-4 as GPT-4 was over GPT-3.5 as far as benchmarks can tell. Here’s a comparison table of GPT-3.5 and GPT-4, compared with the best released open weights model (DeepSeek V3), the best released close weights models (Claude Sonnet 3.5 (New) and o1),and the best known unreleased model (o3).

	GPT-3.5	GPT-4	DeepSeek-V3 (Open Weights)	Sonnet 3.5 (New)	o1	o3
Context Length	16k	8k	128k	200k	128k	/
HumanEval	48.1%	67%	/	93.7%	/	/
ARC-AGI	<5% [1]	<5% [1]	/	20.3%	32%	88%
SWE-bench Verified	0.4% [2]	2.8% 22.4% [3]	42.0%	49.0% 53.0% [4]	48.9%	71.7%
Codeforces [5]	260 ~1.5%	392 4.0%	~1550 51.6%	~1150 20.3%	1891 ~91.0%	2727 ~99.3%
GPQA Diamond	/	33.0%	59.1%	58.0% 65.0% [6]	78.0%	87.7%
MATH	/	52.9%	90.2%	78.3%	94.8%	/
MMLU	70.0%	86.4%	88.5%	88.3%	92.3%	/
DROP	64.9	80.9	91.6	87.1	/	/
GSM8K	57.1%	92.0%	/	96.4%	/	/

[1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”

[2] 0.4% with RAG, tested October 2023

[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.

[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1

[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.

[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).

https://x.com/OpenAI/status/1870186518230511844

https://openai.com/index/learning-to-reason-with-llms/

https://www.anthropic.com/news/3-5-models-and-computer-use

https://arxiv.org/pdf/2303.08774v5

https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fw8kkutnp6mwd1.png%3Fwidth%3D580%26format%3Dpng%26auto%3Dwebp%26s%3Dfea8540a697f1d6f27e9a32f31eda4378fde611e

https://arcprize.org/guide

https://www.deepseek.com/

https://www.researchgate.net/figure/Averaged-performance-on-the-tasks-from-the-Big-Bench-Hard-benchmark-Here-AO-CoT-and-ZS_tbl1_371163052

--
- Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
  - Debatable | It’s too vague to measure (“pretty much” and “wrong sometimes” -- seriously, what was I thinking). It doesn’t feel like the models can do “any task” in a way that GPT-4 couldn’t, but at the same time “pretty much” every benchmark for LLMs has been saturated, and I ask Claude for help with nearly everything. Agentic tasks can’t be done, but that’s covered by other predictions, and this prediction is about being “guided by a person”.
- Multimodal inputs, browsing, and agents based on it are all significantly better.
  - Mostly True | The agent structures as well as the models have improved significantly, as you can see by the same models doing much better on SWE-Bench under newer structures, and by newer models still beating older ones.

Agents can do basic tasks on computers -- like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
- Debatable | I could see this being graded either way, depending on specific metrics. Claude with Computer Use can do all of these things (sans robotics control) but isn’t really useful. The individual tasks are usefully done by a mix of Gemini, ChatGPT, and Figure’s (GPT-4o?) robot control, but they aren’t really agents.
Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
- Mostly True | There are some production uses for Figure and Tesla’s robots, but these are more similar to traditional industrial robots doing a narrow task than to an agent.
Context windows are no longer an issue for text generation tasks.
- Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
  - Mostly False | Context windows aren’t nearly as limiting as they were in October 2023, growing from ~8k to ~128k, and RAG and other techiques helping models intelligently search files and add them to their own context, but it’s definitely not solved. Long outputs like novels still suck, and long inputs like giant codebases or regulations still lead to models missing key details a lot of the time.
GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
- Mostly False | Although it is close -- Cursor has coding agents that can intelligently search the codebase for the files they need based on a provided task and add them to their own context, and ChatGPT has a memory feature (which doesn’t work super well). Neither of these is the same thing as just having the previous chats and codebase in context though.
- This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
  - Mostly False | Agents are not yet useful, outside of some narrow coding agents.
Online learning begins -- GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
- False | As far as I’m aware, nothing like this is happening.
AI selection of what data to train on is used to improve datasets in general - training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
- Mostly False | The trend has continued to move towards quality over quantity for training data, but I’m not aware of anybody specifically using existing LLMs to select / rank / weight training data automatically. I’m also now aware high quality data was already being repeated more often in the training sets. I don’t think anything is happening with a dynamic learning rate based on anything other than the loss.
Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
- True | But also fairly trivial, it’s super well known that people are training models off the filtered outputs of earlier ones, and in general synethic data is working really well, especially for instruction tuning and for ground-truth’d domains like maths and coding.
Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
- True | Cursor, a fork of Visual Studio, has pretty capable agents built in that use any model available via API that you like, and they work a lot better than manually pasting problems into ChatGPT did in October of 2023.
Open source models as capable of GPT-4 become available.
- True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
  * It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend. This is because of a combination of -- datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.
- True | Deepseek V3 stands out here -- using only 37B active parameters (in a MoE architecture with 671B total), it achieves performance better than GPT-4’s, which is estimated to have more than 1700B. Deepseek V3 was also trained with only 2048 H800 GPUs for 2 months, compared with GPT-4’s estimated 15000 A100 GPUs for 3 months, several times higher.

[-]Moritz Maibaum1y21

Gemini 1206 Exp has a 2 million token context window, even if that isn't the effective context it probably performs much better in that regard than gpt 4o and such. Haven't tested yet because I don't want to get ratelimited from ai studio incase they monitor that

Frankly the "shorter" conversations I had at a few tens of thousand of tokens were already noticeably more consistent than before, e. g. it referenced previous responses significantly later

[-]HunterJay5mo*10

Another year has passed, 27 months total. Time for another review!

2023 Predictions for 2024, reassessed:

First, some predictions for 2024 were wrong because they hadn’t happened yet. Of those, let’s see how wrong I was -- did they happen in 2025 instead?

“Agents can do basic tasks on computers -- like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things”
- <1 year off | I rated this as ‘Debatable’ last year, based on Claude with Computer Use and Figure’s robot control. Today, Claude for Chrome and Claude Code, Codex, etc can clearly do these tasks sans robotics control. The robotics control piece has remained elusive and harder to claim, however Tesla and Figure robots were deployed in some factory use, and use some general purpose transformers for part of their stack, so I think it can be claimed as “actually useful for some of these things” now.
“Context windows are no longer an issue for text generation tasks. Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.”
- <1 year off | I rated this as ‘Mostly False’ for 2024, because although context windows had increased from ~8k to ~128k, I felt there were still limitations. They are now at ~1m, so if you are still having trouble with your text generation tasks, I would say it’s not because of the context window. Gemini 1.5 Pro also had a 1m context window in 2024, though I didn’t think it was effectively usable context at the time.
“GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.”
- >1 year off, debatable framing | I still rate this as mostly false, although searching chats and the codebase works perfectly, and in effect is the same thing. The capability is there, just different to how I framed it in 2023.
“Online learning begins -- GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).”
- >1 year off | I still rate this as false, although current frontier models do help train their successors in several ways, we definitely do not run daily updates to the model.
“AI selection of what data to train on is used to improve datasets in general - training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.”
- <1 year off | This is true today! AIs curate and filter their own data, and curriculum design is a larger part of efficient training. It is all done with AI assistance.

Overall Score for 2023 predictions about 2024:

Leans True 7

Too Vague 1

Leans False but <1 Year Late 3

Leans False and >1 Year Late 2

2023 Predictions for 2025:

“AI agents are used in basic robotics -- like LLM driven delivery robots and (in demos of) household and factory robots, like the Tesla Bot. Multimodal models basically let them work out of the box, although not 100% reliably yet.”
- Leans False | Demos of household and factory robots using LLMs have certainly happened, though the vibes of this prediction are overestimating progress in robotics. Tesla Optimus and Figure AI had small deployments in car factories, but not commercially, matching the prediction. Delivery robots did not use LLMs in any way, as far as I am aware, even for conversing with the sender or receiver, or for reasoning through high level path planning. Real-time multimodal AIs like Gemini Flash can live stream and respond to video and audio out of the box, but not reliably enough to be used as part of a robotics stack directly.
“Trends continue from the previous year:
- The time horizons agents can work on increase
  - True | 100% true, and now well tracked by the famous METR chart.
- LLMs improve on traditional LLM tasks
  - True | Trivially true, and an extremely obvious prediction. Comparing o1 to Opus 4.5, SWE-bench went from ~49% to ~81%, AIME maths went from ~74% to ~100%, and so on.
- Smaller models get more capable
  - True | Claude Haiku 4.5 matches or exceeds Claude Opus 3 on many benchmarks, despite (apparently) being far smaller.
- The best models get bigger.”
  - Debatable | Estimates of the most capable models’ parameter count is higher than previous years, but not by much, and the scaling up of parameters has not been a major source of performance improvements, against my broader expectations. It’s a technically correct prediction that missed the actual trend.
“AI curated and generated data becomes far more common than previously, especially for aligning models.”
- True | Synthetic data became the default approach for LLMs, and is used deeply for alignment training through Constitutional AI. Deepseek’s R1 famously used pure RL for reasoning, and distilled to smaller models using the generated reasoning traces.
“Virtual environments become more common for training general purpose models, combined with traditional LLM training.”
- Debatable | Robotics training in simulation is booming, as are world models as a research area, like Genie 3, however they aren’t commonly used for training frontier LLMs. Those are trained in and for environments like browsers, terminals, with other tools which take actions for them, but that stretches the definition of ‘virtual environment’ a little.
“Code writing AI (just LLMs with context and finetuning) are capable of completely producing basic apps, solving most basic bugs, and working with human programmers very well -- it’s pair programming with an AI, with the AI knowing all of the low level details (a savant who has memorised the docs and can use them perfectly, and can see the entire codebase at once), and the human keeping track of the higher level plan and goals. The AI can also be used to recommend architectures and approaches, of course, and gradually does more and more between human inputs.”
- True | I barely directly write code myself anymore, as Opus 4.5 in Claude Code did nearly all of it for me by December 2025. I do still need to track the higher level plan, architecture, and goals, as predicted.
“If there ever feels like a lull in progress, it will be in this period leading up to models capable enough for robotics control, long time frame agents, and full form video generation, which I don’t expect to happen in an large scale way in 2025.”
- Leans True | There was talk of feeling like there was a lull in early-mid 2025, around the time leading up to and including GPT-5’s launch, and it is correct that robotics control, long time horizon agents (more than a few hours), and full form video generation haven’t taken off yet.
“Possibly GPT-6 or equivalent is released, but more likely continuous improvements to GPT-5 carry forward. There’s not a super meaningful difference at this point, with online learning continually improving existing models.”
- Leans False | No online learning in the sense I meant it, though we do have a focus on better post-training leading to many model releases sharing the same pre-trained base. I also think it is surprisingly debatable whether the jumps in capability in the chain GPT-3 --> GPT-4 --> o1 --> Opus 4.5 were roughly equally sized jumps in capability (and hence whether Opus 4.5 is GPT-6 equivalent from 2023s perspective), though I would still assess as ‘probably not / leans false’.

Overall Score for 2023 predictions about 2025:

Leans True || True 6 (3 non-trivial)

Debatable 2

Leans False || False 2

Overall, I think my predictions matched my calibration about them, and potentially did even slightly better than my 40% - 70% claim. The biggest mistake was predicting some form of continual learning, and by far the biggest omission was to say nothing about reasoning models, which would become the dominant paradigm a year after my original write up. I did talk about runtime search in July 2024, several months before o1-preview was announced, but completely missed it in 2023.

I was pretty accurately calibrated on AI software capabilities, but too bullish on robotics. In hindsight, the ‘lull’ prediction for 2025 was probably the most unlikely one to get mostly correct, and I think compared to (my memory of) other predictions at the time I correctly downweighted video generation and upweighted coding automation.

See you all next year!

[-]Daniel Kokotajlo3y30

(i.e. as big a jump as GPT-3 to 4, to 5, to 6).

Earlier you said the jump between 4 and 5 was as big as the jump between 3.5 to 4.

[-]HunterJay3y10

Thanks, fixed. I did mean 3.5 to 4, not 3 to 4.

[-]Moritz Maibaum1y20

Great calls for 2024, I'd say most are atleast partially accurate.

However, looking at 2026, you definitely underestimated the pace of txt2video development like myself. Given that Veo 2 can already make sequences with cuts and show the same subject across both clips, the 60s consistency will probably be reached in 2025. However above DALL-E 3's quality, that has been surpassed now.

I'd say in late 2026 at the earliest or more realistically late 2027 because of compute constraints we'll see a product that can generate coherent feature film length, optionally photorealistically.

As for humanoid robots I'd say they are market ready at a reasonable price in 2027 or 2028. It doesn't make sense to look at cloud sota here because these robots will likely have to use edge compute due to privacy concerns when going mainstream in households and that in turn draws a lot of energy when running in real time across modalities. So multiple hardware issues to solve.

I'm also betting that there will be a Her-like product in the next 12 to 16 months that is indistinguishable from the movie version.

[-]HunterJay1y20

I agree, I definitely underestimated video. Before publishing, I had a friend review my predictions and they called out video as being too low, and I adjusted upward in response and still underestimated it.

I'd now agree with 2026 or 2027 for coherent feature film length video, though I'm not sure if it would be at feature film artistic quality (including plot). I also agree with Her-like products in the next year or two!

Personally I would still expect cloud compute to still be used for robotics, but only in ways where latency doesn't matter (like a planning and reasoning system on top of a smaller local model, doing deeper analysis like "There's a bag on the floor by the door. Ordinarily it should be put away, but given that it wasn't there 5 minutes ago, it might be actively used right now, so I should leave it..."). I'm not sure the privacy concerns will trump convenience, like with phones.

I also now think virtual agents will start to become a big thing in 2025 and 2026, doing some kinds of remote work, or sizable chucks of existing jobs autonomously (while still not being able to automate most jobs end to end)!

[-]Moritz Maibaum1y41

2 days ago and I might already have to adjust the timelines.

Nvidia's new Digits costs 3K and is the size of a mac mini. Two of them can supposedly run a 400B parameter language model which is crazy. So maybe the hardware issues aren't as persistent for robotics.

And also Hailuo has a single-image reference mode now that works like a lora. It's super consistent for faces, even if the rest is a bit quirky.

[-]Mike Capuano2y10

I see some discussion here and in the associated Reddit thread about more efficient and smaller models. I think ChatGPT4 is at about one trillion parameters. I was under the impression that model sizes were increasing at about 10x/year so that could mean GPT5 is 10 trillion and GPT6 (or equivalent) is 100 trillion parameters by 2026. Does that sound about right or is there some sort of algorithmic change likely to happen that will allow LLMs to improve without the number of parameters growing 10x/year?

On a related note, I've heard backend cluster sizes are supposedly growing at similar rates. 32K nodes with 8 GPUs per node today growing at 10x per year. To me that seems improbable as it would be 320K nodes in 2025 and then 3.2M nodes in 2026.

Thoughts or info that you might have here?

[-]HunterJay2y10

10x per year for compute seems high to me. Naïvely I would expect the price/performance of compute to double every 1-2 years as it has been forever, with overall compute available for training big models being a function of that + increasing investment in the space, which could look more like one-time jumps. (I.e. a 10x jump in compute in 2024 may happen because of increased investment, but a 100x increase by 2025 seems unlikely.) But I am somewhat uncertain of this.

For parameters, I definitely think the largest models will keep getting bigger, and for compute to be the big driver of that -- but also I would expect improvements like mixture of experts models to continue, which effectively allow more parameters with less compute (because not all of the parameters are used at all times). Other techniques, like RLHF, also improve the subjective performance of models without increasing their size (i.e. getting them to do useful things rather than only predict what next word is most likely).

I guess my prediction here would be simply that things like this continue, so that in 2025 if you have X compute, you could get a better model in 2025 than you could in 2023. But you also could have 5x to 50x more compute in 2025, so you have the sum of those improvements!

It's obviously far cheaper to play with smaller models, so I expect lots of improvements will initially appear in models small-for-their-time.

Just my thoughts!

[-]Sergii3y10

I have a similar background (working at a robotics startup), would agree with many points.

GPT-5 or equivalent is released. It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.

GPT-4 has (possibly) 10x parameters compared to GPT-3.5. Similar jump in GPT-5 might require 10x parameters again, wouldn't it make it impractical (slow, expensive) to run?

AI agents are used in basic robotics -- like LLM driven delivery robots and (in demos of) household and factory robots

GPT-4 level models are too slow and expensive or real-time applications, how do you imagine this could work? Even in recent Google's robotics demos that are based on "small" transformers, inference speed is one of the bottlenecks.

[-]p.b.3y10

If you scale width more than depth and data more than parameters you can probably go some ways before latency becomes a real problem.

Additionally, it would also make sense to take more time (i.e. larger models) for harder tasks. The user probably doesn't need code or mathematical solutions instantly, as long as its still 100X faster than a human.

In robotics you probably need something hierarchical, where low-level movements are controlled by small nets.

[-]HunterJay3y10

Agree on lower depth models being possible, a few other possibilities:

Smaller models with lower latency could be used, possibly distilled down from larger ones.
Compute improvements might make it practical onboard (like with Tesla's self-driving hardware inside the chest of their andriod).
New architectures could work on more than one time scale -- kind of like humans do. E.g. when we walk, not all of the processing is done in the brain. Your spinal cord can handle a tonne of it autonomously. (Will find source tomorrow).
LLM-type models could do the parts that can accept higher latency, leaving lower level processes to handle themselves. Imagine for a household cleaning robot that a LLM based agent puts out high level thoughts like "Scan the room for dirty clothes. ... Fold them. ... Put them in the third draw", and existing low level stuff actually carried out the instructions. That's an exaggerated example, but you get the idea, it doesn't have to replace the PID controller!

[-]HunterJay3y10

I wrote this late at night, so to clarify and expand a little bit;

- "Work on more than one time scale" I think is actually an interesting idea to dwell on for a second. Like, when a person is trying to solve a problem, they will often pace back and forth, or talk, etc. They don't have to do everything in one pass, somehow the complex computation which lets them see and move around can work on a very fast time scale, while other problem solving is going on simultaneously, and only starts to effect motor outputs later on. That's interesting. The spinal cord doing processing independent of the brain thing I mentioned is evident in this older series of (rather horrible) experiments with cats: https://www.jstor.org/stable/24945006

- On the 'smaller models with lower latency', we already now see models like Minstral-7b outperforming 30b parameter models because of improvements in data, architecture, and training. I expect this trend to continue. If the largest models are capable of operating a robot out of the box, I think you could take those outputs, and use them to train (or otherwise distill down) the larger model to a more manageable size, more specialised for the task.

- On the 'LLMs could do the parts with higher latency', just yesterday I saw somebody do something like this with GPT-4V, where they periodically uploaded a photograph of what was in front of them, and got GPT-4V to output instructions on how to find the super market (walk further forward, turn right, etc). Kind of worked, that's the sort of thing I was picturing here, leaving much more responsive systems to handle the low latency work, like balance, gripping, etc.

[-]Ziyue Wang3y10

Interesting to read! Curious about your prediction about AI safety related progress. Not sure how much impact it will have on your current prediction.

[-]HunterJay3y50

I am extremely worried about safety, but I don't know as much about it as I do about what's on the edge of consumer / engineering trends, so I think my predictions here would be not useful to share right now! The main way it relates to my guesses here is if regulation successfully slows down frontier development within a few years (which I would support).

I'm doing the ARENA course async online at the moment, and possibly moving into alignment research in the next year or two, so hoping to be able to chat more intelligently on alignment soonish.

Moderation Log