The major shift in the next 3 years will be that, as a rule, top level AI labs will not release their best models. I'm certain this has somewhat been the case for OpenAI, Anthropic and Google for the past year. At some point full utilization of a SOTA model will be a strategic advantage for companies themselves to use for their own tactical purposes. The moment any $X of value can be netted from an output/inference run of a model for less than $(X-Y) in costs, where Y represents the marginal labor/maintenance/averaged risk costs for each run's output, no company would ever be advantaged by releasing the model to be used by anyone other than themselves. This closed-source event horizon I imagine will occur sometime in late 2024.
This is a very good, and very scary point - another thing that could provide, at least the appearance of, a discontinuity. One symptom of this this scenario would be a widespread, false belief that "open source" models are SOTA.
Might be good to brainstorm other symptoms to prime ourselves to recognize when we are in this scenario. Complete hiring-freezes/massive layoffs at the firms in question, aggressive expansion into previously-unrelated markets, etc.
Not sure I understand; if model runs generate value for the creator company, surely they'd also create value that lots of customers would be willing to pay for. If every model run generates value, and there's ability to scale, then why not maximize revenue by maximizing the number of people using the model? The creator company can just charge the customers, no? Sure, competitors can use it too, but does that really override losing an enormous market of customers?
That's very true, but there are two reasons why a company may not be inclined to release an extremely capable model:
1. Safety risk: someone uses a model and jailbreaks it in some unexpected way, the risk of misuse is much higher with a more capable model. OpenAI had GPT-4 for 9-10 months before releasing it trying to RHLF and even lobotomized it to being more safe. The Summer 2022 internal version of GPT-4 was, according to Microsoft researchers, more generally capable than the released version (as evidenced by the draw a unicorn test). This needed delay and assumed risks will naturally be much greater with a larger model, both in that larger models, so far, seem harder to simply RHLF into unjailbreakability, and by being more capable, any jailbreak carries more risk, thus the general business level margin of safety will be higher.
2. Sharing/exposing capabilities: Any business wants to maintain a strategic advantage. Releasing a SOTA model will allow a company's competitors to use it, test its capabilities and train models on its outputs. This reality has become more apparent in the past 12 months.
It does seem to me a little silly to give competitors API access to your brain. If one has enough of a lead, one can just capture your competitors markets.
Text data is running out, the 2+ billion dollar scale training runs due in 2-4 years are going to devour the rest of it. This might be sufficient to reach AGI, in the sense of capability for mostly autonomous research, in particular development of compute multipliers for training runs and plucking the rest of low hanging fruit of the unsupervised learning revolution, overcoming scarcity of hardware.
If AGI is not in range of those runs, and if there is no synthetic data generation process useful at that scale, the bulk of compute goes to multimodality (though latency will still cripple many use cases), and the rate of competence improvement may slow for years. This is the main scenario where I see significant hope for regulation to take hold. Doing better and countering the risk of AGI in the initial rush to billion dollar scale runs requires a nebulously defined pause right now.
One way to get more data is to pay humans to create the specific types of data we need. For example, if billion people write 100 pages on the unique topic of their expertise each - and the needed data generation will be controlled by AI - maybe that will be enough.
I'm somewhat skeptical that running out of text data will meaningfully slow progress. Today's models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.
Also, as you say;
- Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming)
- Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either -- it could be more like playing games or using a computer).
- And multimodal inputs gives us a lot more data too, and I think we've only really scratched the surface of multimodal transformers today.
New untested ideas take unpredictable time to develop. Given the current timeline of pure compute/investment scaling, there is no particular reason for all bottlenecks to be cleared just in time for scaling to continue without slowing down. Hence the possibility of it slowing down at the upcoming possible bottlenecks of natural text data and available-on-short-notice hardware, which are somewhat close together.
Sample efficiency (with respect to natural data) can in principle be improved, humans and some RL systems show it's possible, and synthetic data is a particular form this improvement might take. But it's not something that's readily available, known to subsume capabilities of LLMs and scale past them. Also, straying further from the LLM recipe of simulating human text might make alignment even more intractable. In a universe where alignment of LLMs is feasible within the current breakneck regime, the source of doom I worry about is an RL system that either didn't train on human culture or did too much reflection to remain within its frame.
Compared to natural text, multimodal data and many recipes for synthetic data give something less valuable for improving model competence, reducing return on further scaling. When competence improvement slows down, and if AGI in the sense of human-level autonomous work remains sufficiently far away at that point, investment scaling is going to slow down as well. Future frontier models cost too much if there is no commensurate competence improvement.
My hunch is that there's sufficient text already if an AI processes it more reflectively. For example, each chunk of text can be fed through a series of LLM prompts intended to enrich it, and then the model trains on the enriched/expanded text.
I dislike this type of post. Predictions are nice and all, but I can take no updates from these predictions because I know nothing about the author or any reason to give their predictions credence and no real evidence is given to justify the predictions (evidence could include something like making real money bets on prediction markets, not just data and arguments!). So for me this post is nothing but speculation, and I'm disappointed that LessWrong readers voted it up.
Note: This is nothing against the post's author! I think it's totally fine for a person to speculate. I'm just both surprised this became a frontpage post and that it was voted up!
I am honestly very surprised it became a front page post too! It totally is just speculation.
I tried to be super clear that these were just babbled guesses, and I was mainly just telling people to try to do same, rather than trusting my starting point here.
The other thing that surprised me is that there haven't been too many comments saying "this part is off", or "you missed trend X!". I was kind of hoping for that!
I really like this post, but more for:
than updating my own credences on specifics.
I agree the author should attach credences. I'd also appreciate a little more specificity, for example with the prediction, "Google releases something competitive to ChatGPT." I'm not sure whether that refers to ChatGPT-3.5 or ChatGPT-4 and the meaning here is actually quite important.
oint, a major step change will also happen when AI is capable of generating new major scientific breakthroughs on its own -- more akin to Einstein coming up with relativity to explain known data than
The acceleration of AI R&D will begin sooner than that, I think. We could get 10x speedup just by automating the typical openai engineer I think.
I broadly agree. I think AI tools are already speeding up development today, and on reflection I don't actually think AI being more capable than humans at modeling the natural world would be a discontinuous point on the ramp up to superintelligence, actually.
It would be a point where AI gets much harder to predict, though, which is probably why it was on my mind when I was trying to come up with predictions.
And OpenAI has explicitly said this is what they want to do! Their Superalignment strat looks suspiciously like "gunning for RSI".
I see some discussion here and in the associated Reddit thread about more efficient and smaller models. I think ChatGPT4 is at about one trillion parameters. I was under the impression that model sizes were increasing at about 10x/year so that could mean GPT5 is 10 trillion and GPT6 (or equivalent) is 100 trillion parameters by 2026. Does that sound about right or is there some sort of algorithmic change likely to happen that will allow LLMs to improve without the number of parameters growing 10x/year?
On a related note, I've heard backend cluster sizes are supposedly growing at similar rates. 32K nodes with 8 GPUs per node today growing at 10x per year. To me that seems improbable as it would be 320K nodes in 2025 and then 3.2M nodes in 2026.
Thoughts or info that you might have here?
10x per year for compute seems high to me. Naïvely I would expect the price/performance of compute to double every 1-2 years as it has been forever, with overall compute available for training big models being a function of that + increasing investment in the space, which could look more like one-time jumps. (I.e. a 10x jump in compute in 2024 may happen because of increased investment, but a 100x increase by 2025 seems unlikely.) But I am somewhat uncertain of this.
For parameters, I definitely think the largest models will keep getting bigger, and for compute to be the big driver of that -- but also I would expect improvements like mixture of experts models to continue, which effectively allow more parameters with less compute (because not all of the parameters are used at all times). Other techniques, like RLHF, also improve the subjective performance of models without increasing their size (i.e. getting them to do useful things rather than only predict what next word is most likely).
I guess my prediction here would be simply that things like this continue, so that in 2025 if you have X compute, you could get a better model in 2025 than you could in 2023. But you also could have 5x to 50x more compute in 2025, so you have the sum of those improvements!
It's obviously far cheaper to play with smaller models, so I expect lots of improvements will initially appear in models small-for-their-time.
Just my thoughts!
I have a similar background (working at a robotics startup), would agree with many points.
GPT-5 or equivalent is released. It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
GPT-4 has (possibly) 10x parameters compared to GPT-3.5. Similar jump in GPT-5 might require 10x parameters again, wouldn't it make it impractical (slow, expensive) to run?
AI agents are used in basic robotics -- like LLM driven delivery robots and (in demos of) household and factory robots
GPT-4 level models are too slow and expensive or real-time applications, how do you imagine this could work? Even in recent Google's robotics demos that are based on "small" transformers, inference speed is one of the bottlenecks.
If you scale width more than depth and data more than parameters you can probably go some ways before latency becomes a real problem.
Additionally, it would also make sense to take more time (i.e. larger models) for harder tasks. The user probably doesn't need code or mathematical solutions instantly, as long as its still 100X faster than a human.
In robotics you probably need something hierarchical, where low-level movements are controlled by small nets.
Agree on lower depth models being possible, a few other possibilities:
Smaller models with lower latency could be used, possibly distilled down from larger ones.
Compute improvements might make it practical onboard (like with Tesla's self-driving hardware inside the chest of their andriod).
New architectures could work on more than one time scale -- kind of like humans do. E.g. when we walk, not all of the processing is done in the brain. Your spinal cord can handle a tonne of it autonomously. (Will find source tomorrow).
LLM-type models could do the parts that can accept higher latency, leaving lower level processes to handle themselves. Imagine for a household cleaning robot that a LLM based agent puts out high level thoughts like "Scan the room for dirty clothes. ... Fold them. ... Put them in the third draw", and existing low level stuff actually carried out the instructions. That's an exaggerated example, but you get the idea, it doesn't have to replace the PID controller!
I wrote this late at night, so to clarify and expand a little bit;
- "Work on more than one time scale" I think is actually an interesting idea to dwell on for a second. Like, when a person is trying to solve a problem, they will often pace back and forth, or talk, etc. They don't have to do everything in one pass, somehow the complex computation which lets them see and move around can work on a very fast time scale, while other problem solving is going on simultaneously, and only starts to effect motor outputs later on. That's interesting. The spinal cord doing processing independent of the brain thing I mentioned is evident in this older series of (rather horrible) experiments with cats: https://www.jstor.org/stable/24945006
- On the 'smaller models with lower latency', we already now see models like Minstral-7b outperforming 30b parameter models because of improvements in data, architecture, and training. I expect this trend to continue. If the largest models are capable of operating a robot out of the box, I think you could take those outputs, and use them to train (or otherwise distill down) the larger model to a more manageable size, more specialised for the task.
- On the 'LLMs could do the parts with higher latency', just yesterday I saw somebody do something like this with GPT-4V, where they periodically uploaded a photograph of what was in front of them, and got GPT-4V to output instructions on how to find the super market (walk further forward, turn right, etc). Kind of worked, that's the sort of thing I was picturing here, leaving much more responsive systems to handle the low latency work, like balance, gripping, etc.
Interesting to read! Curious about your prediction about AI safety related progress. Not sure how much impact it will have on your current prediction.
I am extremely worried about safety, but I don't know as much about it as I do about what's on the edge of consumer / engineering trends, so I think my predictions here would be not useful to share right now! The main way it relates to my guesses here is if regulation successfully slows down frontier development within a few years (which I would support).
I'm doing the ARENA course async online at the moment, and possibly moving into alignment research in the next year or two, so hoping to be able to chat more intelligently on alignment soonish.
Epistemic status: My, mostly intuitive, guesses - with only a few days dwelling on it, and no serious research beyond what I already knew.
I work in the startup sphere, in field robotics, and I am about to have an opportunity to majorly shift what I am working on. To work out what projects might make sense on a multi-year time frame, I wrote up what I thought might happen in AI in the next couple of years as specifically as I could.
I found the exercise surprisingly useful. It turned a whole bunch of vague "X will get better over time" to actionable "X will be practical in around Y years". I don't think my guesses will end up actually being very accurate, but having something solid forced me to actually think about the future and make my gibberish-internal-intuitions into more-consistent-guesses. I was really surprised at how much it helped, actually.
So, here's that list of predictions. I'm sharing it here more as a "here's how you can do something similar" than as a "here's a well researched report on the future trends." (which it definitely is not). I didn't go to the trouble of putting my %'s on the guesses "X in year Y", but it's about 40%-70% for any given guess.
________________________________
[Written 6th October 2023]
Rest of 2023:
2024
2025
2026
2027 & Beyond