This is a fantastic post. Big upvote.
I couldn't agree more with your opening and ending thesis, which you put ever so gently:
the current portfolio is over-indexed on work which treats “transformative AI” as a black box
It seems obvious to me that trying to figure out alignment without talking about AGI designs is going to be highly confusing. It also seems likely to stop short of a decent estimate of the difficulty. It's hard to judge whether a plan is likely to fail when there's no actual plan to judge. And it seems like any actual plan for alignment would reference a way AGI might use knowledge and make decisions.
WRT the language model agent route, you've probably seen my posts, which are broadly in agreement with your take:
Capabilities and alignment of LLM cognitive architectures
Internal independent review for language model agent alignment
The second focuses more on the range of alignment techniques applicable to LMAs/LMCAs. I wind up rather optimistic, particularly when the target of alignment is corrigibility or DWIM-and-check.
It seems like even if LMAs achieve AGI, they might progress slowly beyond the human-level source of the LLM training. That could be a really good thing. I want to think about this more.
I'm unsure how much to publish on possible routes. Right now it seems to me that advancing progress on LMAs is actually a good thing, since they're more transparent and directable than any other AGI approach I can think of. But I don't trust my own judgment when there's been so little discussion from the hardcore alignment-is-hard-crowd.
It boggles my mind that posts like this, forecasting real routes to AGI and alignment, don't get more attention and discussion. What exactly are people hoping for as alignment solutions if not work like this?
Again, great post, keep it up.
Good post!
In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance
You address this to some extent later on in the post, but I think it's worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.
And in fact for tasks like "produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences" the performance of LLMs does not particularly resemble the performance of humans on those tasks.
I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.
I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.
Agree. I think there are pretty strong reasons to believe that with a concerted effort, we can very likely (> 90% probability) build safe scaffolded LM agents capable of automating ~all human-level alignment research while also being incapable of doing non-trivial consequentialist reasoning in a single forward pass. Also (still) looking for collaborators for this related research agenda on evaluating the promise of automated alignment research.
In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance
For a more detailed analysis of how this problem could be overcome but why doing so is unlikely to be a fast process, see my post LLMs May Find it Hard to FOOM. (Later parts of your post have some overlap with that, but there are some specifics such as conditioning and extrapolation that you don't discuss, so readers with find some more useful content there.)
I think there are two really important applications, which have the potential to radically reshape the world:
- Research
- The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
- Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
- In particular the automation of further AI development is likely to be important
- There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fundamental physics vs political philosophy
- The sequence in which we get the ability to automate different types of research could be pretty important for determining what trajectory the world is on
- Executive capacity
- The ability to look at the world, form views about how it should be different, and form and enact plans to make it different
- (People sometimes use “agency” to describe a property in this vicinity)
- This is the central thing that leads to new things getting done in the world. If this were fully automated we might have large fully autonomous companies building more and more complex things towards effective purposes.
- This is also the thing which, (if/)when automated, creates concerns about AI takeover risk.
I agree. I tentatively think (and have been arguing in private for a while) that these are 'basically the same thing'. They're both ultimately about
They differ (just as research disciplines differ from other disciplines, and executing in one domain differs from other domains) in the specifics, especially on what existing models are useful and the 'research taste' required to generate experiment ideas and estimate value-of-information. But the high level loop is kinda the same.
Unclear to me what these are bottlenecked by, but I think the latent 'research taste' may be basically it (potentially explains why some orgs are far more effective than others, why talented humans take a while to transfer between domains, why mentorship is so valuable, why the scientific revolution took so long to get started...?)
In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.
I think it’s most likely that for a while centaurs will significantly outperform fully automated systems
Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.
Very high-effort, comprehensive post. Any interest in making some of your predictions into markets on Manifold or some other prediction market website? Might help get some quantifications.
At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play.
With some more effort, this also applies to "prove this mathematical conjecture" (using automated proof checkers like Lean) and (with suitably large and well-deigned automated test suites) also to "write code to solve this problem". These seem like areas broad enough that scaling them up to far superhuman levels, as well as being inherently useful, might also carry over towards other tasks requiring rational and logical thinking. Also, this would probably be ab ideal forum in which to work on solutions to the 'drunkenness' issue.
1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.
I'm very puzzled by this opinion. If we can reduce the 'drunkenness' issue, this type of agency scales to at least the competence level of most competent humans (or indeed, fictional characters) in existence, and probably at least some distance beyond by extrapolation (and run cheaply in faster than realtime). These agents are not safe: humans are not fully aligned to human values, power corrupts, and Joseph Stalin was not well aligned with the needs to the citizenry of Russia. This seems like plenty to be concerned about, rather than a sideshow. Now, the ways in which they're not aligned are at least ones we have a good intuitive and practical understanding of, and some partial solutions for controlling (things like love, guilt, salaries, and law enforcement).
I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.
On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.
In a world where the 'big two' (R&D and executive capacity) are characterised by driving beyond the frontier of the well-understood it's all about data gathering and sample-efficient incorporation of the data.
FWIW I don't think vanilla 'fine tuning' necessarily achieves this, but coupled with retrieval augmented generation and similar scaffolding, incorporation of new data becomes more fluent.
Notable techniques for getting value out of language models that are not mentioned:
Also, I would say, retrieval-augmented generation (RAG) is not just a mundane way to industrialise language model, but an important concept whose properties should be studied separately from scaffolding or fine-tuning or other techniques that I listed in the comment above.
Thanks. At a first look at what you're saying I'm understanding these to be subcategories of using finetuning or scaffolding (in the case of leveraging semantic knowledge graphs) in order to get useful tools. But I don't understand the sense in which you think finetuning in this context has completely different properties. Do you mean different properties from the point where I discuss agency entering via finetuning? If so I agree.
(Apologies for not having thought this through in greater depth.)
I think you tied yourself too much to the strict binary classification that you invented (finetuning/scaffolding). You overgeneralise and your classification blocks the truth more than clarifies things.
All the different things that can be done by LLMs: tool use, scaffolded reasoning aka LM agents, RAG, fine-tuning, semantic knowledge graph mining, reasoning with semantic knowledge graph, finetuning for following "virtue" (persona, character, role, style, etc.), finetuning for model checking, finetuning for heuristics for theorem proving, finetuning for generating causal models, (what else?), just don't easily fit into two simple categories with the properties that are consistent within the category.
But I don't understand the sense in which you think finetuning in this context has completely different properties.
In the summary (note: I actually didn't read the rest of the post, I've read only the summary), you write something that implies that finetuning is obscure or un-interpretable:
From a safety perspective, language model agents whose agency comes from scaffolding look greatly superior than ones whose agency comes from finetuning
- Because you can get an extremely high degree of transparency by construction
But this totally doesn't apply to these other variants of finetuning that I mentioned. If the LLM creates is a heuristic engine to generate mathematical proofs that are later verified with Lean, it just stops to make any sense to discuss how interpretable or transparent these theorem-proving or model-checking LLM-based heuristic engine.
Strong upvote. We are definetely not talking enough about what Scaffolded Language Model Agents mean for AI alignment. They are the light of hope, interpretable by design systems with tractable alignment and slow take off potential.
One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding.
This should be forbidden. Turning explicitly written in code scaffolding into another black box not only will greatly damage interpretability but also poses huge risks of accidentally creating a sentient entity without noticing it. Scaffolding for LMAs serve a very similar role to consciousness for humans, so we should be very careful in this regard.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
1. Introduction
1.1 Summary of key claims
1.2 Meta
We know that AI is likely to be a very transformative technology. But a lot of the analysis of this point treats something like “AGI” as a black box, without thinking too much about the underlying tech which gets there. I think that’s a useful mode, but it’s also helpful to look at specific forms of AI technology and ask where they’re going and what the implications are.
This doc does that for language models. It’s a guide for thinking about them from various angles with an eye to what the strategic implications might be. Basically I’ve tried to write the thing I wish I’d read a couple of years ago; I’m sharing now in case it’s helpful for others.
The epistemic status of this is “I thought pretty hard about this and these are my takes”; I’m sure there are still holes in my thinking (NB I don’t actually do direct work with language models), and I’d appreciate pushback; but I’m also pretty sure I’m capturing some important dynamics which aren’t as broadly appreciated as they should be. Many of the particular insights here are due to other people. I want to say thanks to Adam Bales, Anna Wang, Buck Shlegeris, Carl Shulman, Daniel Dewey, Eric Drexler, Max Dalton, Nate Soares, Rebecca Cotton-Barratt, Rohin Shah, Rose Hadshar, Tom Davidson, and especially Beth Barnes, David Manheim, Lukas Finnveden, and Toby Ord, for helpful comments and/or conversations.
2. What type of thing are language models?
2.1 Emulating civilization, not individual people
The field of AI was originally about reproducing human intelligence. Humans are good at finding patterns and learning things. If we could automate the type of thinking they do, that would be a big deal. If we could build automated systems which were better general learners and thinkers than humans, it would transform the world.
Language models aren’t really trying to do the same thing. This may be a surprising claim; they’re a type of machine learning, which is doing exactly this. However, I think it’s clearer to think of language models as a specialized application of machine learning. Sure, they make use of machine learning techniques, but their game isn’t really “be better than humans at learning from a certain amount of language” (indeed they’re fed with so much data that they can be much more inefficient than humans, and I don’t think this is a crux for how important they will be). It’s “replicate the kind of things humans say”.
This is powerful because humans, collectively, know a bunch of stuff, both implicitly and explicitly. There’s a lot of knowledge and intelligence which is crystallized in our writing. If the language models of today seem to know a lot of things, this isn’t because they’ve gone out and understood the world directly, but because they’re leveraging knowledge which is represented in human text.
Moreover language is the medium via which we construct concepts and make explicit arguments — powerful tools for understanding and acting in the world. The ability to approximate human writing — even if not based on the same underlying learning abilities — might reproduce a lot of that intelligence.
All of this matters for thinking about the impacts language models are likely to have, and where they might be going. In slogan form, perhaps:
Note that language models could be used to emulate the written output of individual people, if a prompt was specific enough that it tightly specified the author. But this isn’t their default mode — mostly predicting text will depend on averages across a lot of different (possible) people (weighted by how likely those people were to be writing about the topic).
2.2 An extremely crude picture of how language models work
For the purposes of this document, what I think is important:
2.3 What are foundation models approximating?
We can think of foundation models as a series of approximations. A given foundation model Wi approximates the limit WText of what we could achieve with ideal machine learning and all extant text. This in turn approximates WOmega, which is the true distribution human writing is drawn from. Foundation models can never actually achieve “the true distribution”, but understanding that this is what they’re approximating may help us to understand their scope as a technology.
Here’s a digression digging a bit deeper on these concepts:
3. Techniques for getting value from language models
A major focus of research on language models has been on improving the foundation models — getting better approximations to WText. But there is important complementary research in the question: for a fixed foundation model Wi, how can you do useful things? There are few different techniques:
3.1 Prompt engineering
The output of foundation models depends on the prompts they are given. This would be true of WOmega — the value of being able to sample from all possible human documents would be importantly dependent on the ability to steer towards the most useful parts of document space. For the weaker foundation models we have, there may be other helpful tricks in designing prompts.
Over the last couple of years, as people have played around with language models, there has been a lot of parallelized labour into finding the style of prompts that is most likely to lead to good things. To the extent that people are finding knowledge about how to get value out of WOmega, this will generalize to future language models; to the extent that they’re learning tricks peculiar to the current generation of foundation models, it may not.
3.2 Scaffolding
Scaffolding is the general category of designing environments around language models which feed them prompts and process their outputs. Scaffolding is a broad category of which the most straightforward case is just prompt engineering, but in general it allows for complex procedures where the output in response to earlier prompts is fed into other software tools, and these determine what is put into later prompts.
For example, scaffolding could allow for a model to make multi-stage plans and then call separate instances to execute each of those stages without losing track or where it is, and to make use of tools such as browsing the internet and writing and executing software.
Limits of what might be achievable via scaffolding are discussed in Section 6.3.2.
3.3 Finetuning
Finetuning takes a foundation model and runs more machine learning to adjust just some of the weights — using the foundation model to give an inductive in its search for more refined models. The idea is that it’s much easier to find models which are smart in arbitrary ways if you’re restricted to a much smaller-dimensional search space. For small amounts of finetuning, we might think of the inductive bias as being roughly “only consider saying things that humans might say”. For larger amounts of finetuning the bias might be more structural, making use (in opaque ways) of implicit knowledge the language model has to restrict the search space.
Finetuning relies on having some metric, or feedback loop, to train things towards. This could be given by some body of text it’s trying to emulate, or by some other function of text output.
3.4 Combining these
Scaffolding and finetuning can be combined. Generically I think they will be. For it not to make sense to use scaffolding it would be the case that the trivial scaffold performed (roughly) as well as anything else. I think this is implausible at least in the short term. And it would be even more surprising if foundation models — which were selected for their ability to emulate human outputs — happened to be optimized among close by systems for their performance when used in an effective scaffold. I therefore think it’s implausible that it won’t be optimal to make use of finetuning.
We might think of finetuning as analogous to on-the-job training for the use-case at hand, and scaffolding as analogous to setting up a good management structure and organizational protocols. The analogy supports the idea that a combination of the two may be most effective.
4. Natural limits of language models
In Section 5 we’ll start to look at the impacts language models will have in the world as they are further developed and deployed. In order to facilitate that, in this section we’ll look at some natural limits on the kind of things language models are doing. We’ll be concerned with “what kind of outputs can they produce?”; questions of how fast they can produce those, or how they are integrated into society are of central importance for how much impact they end up having, but out of scope for what I want to explore in this section.
4.1 Approximating human capabilities, not superhuman capabilities
There’s a common argument about AI that goes roughly:
Foundation models have been rapidly advancing towards giving human-level responses to many different types of questions: they are rapidly approaching human-level at writing poetry, or explaining physics, or concocting recipes — in the sense that they are far closer to human level now than they were three years ago. Foundation models, however, are emulating human outputs. To the extent that they have human capabilities, they have these via emulation. So the argument doesn’t apply (at least in the straightforward way); rather, we should expect progress to slow down when the quality of their outputs are somewhere in the vicinity of (peak) human performance.
There are a couple of important caveats here:
4.2 Limited cognition per forward pass
To produce a single token, a language model makes a single forward pass over the neural net. To produce longer pieces of text, it repeatedly produces single tokens, with everything it’s produced so far added to the context.
Each forward pass amounts to something of similar complexity to multiplying together some large matrices. This gives lots of room for something like consulting an index and accessing stored knowledge, but relatively limited space for something like “thinking new thoughts”.
By analogy, when humans learn arithmetic they do it by a mix of rote memorization — many of us see “3x7” and instinctively know that the answer is “21” without calculating anything — and processes for calculating things (e.g. long division). Language models are structured in a way that can make them good at the rote memorization part, but they cannot in a single forward pass do a large amount of following a process.
This means that we can construct tasks that even very strong foundation models will predictably be weak at. e.g. —
WOmega probably gets this right most of the time. But WText probably gets it wrong almost all of the time. (Unless there are some heuristic tricks I’m unaware of. I’d be more confident in my example if it asked for prime factorizations.)
There are three important caveats here:
4.3 Missing cognitive moves?
Language models are capable of reproducing some types of ~atomic cognitive move that humans use. There may be others — at least at any given moment in time — that they cannot reproduce.
Reasons that they might not be able to reproduce a given cognitive move:
It’s worth being aware that there could be constraints from these on what language models can do, but that this might change as architectures improve or models become bigger. (Furthermore, it might be that at some stage — if not already — language models can make useful cognitive moves that humans are incapable of.)
Multimodal models
One concern might be that language models are only equipped to deal with things in language. How do multimodal models affect this picture? Multimodal language language models are the same basic technology as language models, but they use encodings of non-text data into a kind of text to allow the models to interface with this non-text data. They can output non-text via the encoding if that’s the thing that the language model predicts will happen.
Multimodal language models are therefore able to interface with and think about non-text data. But they may (at least for now) be more likely to lack the correct architecture to reproduce the type of cognitive moves humans do with non-text data. However, language models could be augmented with various capacities by using scaffolding to give them access to interfaces which permit them to query other kinds of objects (e.g. image processing; running physics simulations).
5. Early major impacts of language models?
5.1 Principles for thinking about this
The main metaphor I use to think about this goes as follows:
Of course this metaphor isn’t perfect (and readers may want to think about its imperfections to critique the conclusions I draw from it), but I think it’s probably pretty good as a starting point. A major intuition that I have about that scenario — which I think is probably accurate about the actual situation with language models — is “wow, there’s a really big prize available here for whoever can figure out how to use these folks to do useful stuff”. And there will certainly be incentives to develop techniques to mitigate the obvious disadvantages of being drunk (e.g. via automated error checking).
A couple of people have mentioned to me another metaphor: a large force of interns. I think this is also good; it’s a little better in suggesting that by default they don’t know much about the task at hand, but a little worse in suggesting that they get their knowledge about the domain by looking things up rather than by half-remembering (or occasionally fabricating) facts, and in suggesting strategies like “identify the good interns” which don’t really translate over.
A quick note/aside on the economics:
OK, so that’s the groundwork. Now to think about what this could mean for where the transformative impacts come. Some observations:
5.2 Important early areas for automation
There are several categories of intellectual labour that I think might be automatable with language models and really important. Three of them together I think might change the world a lot — perhaps on a comparable scale to the industrial revolution, but probably not radically beyond that. In roughly increasing order of importance, they are:
5.3 The big two applications
More important than the preceding, I think there are two really important applications, which have the potential to radically reshape the world:
I think that these two categories are likely at least somewhat harder to get high quality automated work out of than expert advice or management. Why?
I’m not sure how big/thorny these obstructions are. The prizes from automating them are very high, so there will be a lot of pressure to find the paths of least resistance. e.g. even if the most efficient way for humans (and hypothetical ideal AI) to do work here is more like “stare into the void and then bring it back to the domain of language” rather than just doing all the reasoning at the verbal level, if there’s a way to get comparably good results by doing everything at the explicit verbal level and it’s just 100x slower, that could still be enough to get you something transformative.
High quality software engineering has some of the same obstructions, but because it’s so easy to get a high-quality success metric, we may expect self-play to help push model performance up to human-level and beyond relatively early. Research and executive capacity face issues with epistemic grounding: how can you be confident that one angle leads to better takes than another? We may ultimately need to rely on real-world feedback loops to help learn this, but they may be slow.
We should probably expect research and executive capacity to be partially automated (and so performed by centaurs, i.e. human–AI teams) before they’re fully automated. At minimum, many people in research and executive roles spend good fractions of their time on software or management tasks, so automating the latter would increase total capacity for the former.
6. Timelines and takeoffs
6.1 How quickly is all of this likely to happen?
My view is that for a lot of the pieces with significant societal impacts, the fundamental technology is already here. Over the next 5–10 years we might see people building and deploying systems which do a lot of stuff in the world, based on near-term-accessible language models. A lot of innovation will come from startups doing “X with AI”, for various applications X mostly providing expert advice or management services. They will often start by doing it in ways that have human oversight for quality control and training purposes, but reduce the degree of human oversight over time. By default the developers will make use of both finetuning and scaffolding — just hackily throwing stuff together to find out what works.
The vibe I’m imagining for this is something like the Industrial Revolution or the Wild West, not a nuclear arms race. This could be enough to create significant social unease, centred in the middle classes, as many people see their livelihoods threatened, and more feel uncomfortable with how fast everything is changing.
(If I’m wrong about them having big impacts over this timescale, it’s probably because of some important missing cognitive move which restricts their usability — perhaps something about reliability. But my guess is that these kind of issues will turn out not to be a big problem, or will be surmountable given the scale of the prizes.)
We may see something more like a race for big-2 capabilities. Because if fully automated they can potentially be deployed at very large scale by a single actor (rather than quickly saturating demand), the incentives for a pure race could exist. However, I think it’s most likely that for a while centaurs will significantly outperform fully automated systems — if this is right then while there’s quite likely to be a race for full automation at some point, that would occur in a world which looks significantly transformed from the one we see today (where research has already been accelerated by centaur human-AI teams, and a lot of important planning in the world is done by humans aided by AI). The duration of this centaur period — especially how long we have in the “late centaur” period where efficiency of research is many times what it is today — could be important for determining how different that future world is.
I’m pretty unsure how far we are from ~full automation of big-2 capabilities. When I try to visualize future world trajectories and look for the most coherent ones, I think it’s most likely that this is somewhere in the range 5–15 years away; but I’m not confident in this. At the point where that process is really taking off I expect it will overtake the kind of broad societal impacts I’ve just been discussing, if it is not otherwise constrained.
6.2 Scaling language models towards superintelligence
Foundation models get their oomph from approximating human writing. They can approximate smart or knowledgable humans (with the right prompts, or the right training corpus). But for getting significantly superhuman performance, they would need something else. What could that be?
Two techniques which might be helpful components:
6.2.1 Finetuning for superhuman task performance
For tasks with well-defined success metrics, simply training to do well on those tasks could produce superhuman performance. How quickly this will happen is likely to depend on the task. In the limit with a rich enough model space, enough training data, and enough training time, we might expect to end up approximating optimal performance (and hence exceeding human performance) at every task. But in practice performance on some tasks might be capped by what is achievable within the model space, and might face challenges in getting good enough data.
Still, finetuning for superhuman performance seems like an important part of the picture. At tasks like “write an argument which is persuasive to X audience”, where there is lots of data available on the reactions of that audience, we might expect language models to do pretty well pretty quickly (especially to the extent that persuasiveness is a function of local sentence choice and not larger-scale structures of how arguments fit together). At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play. (Though note that within the confines of a single forward pass, the limit on cognition could prevent too much tree search through future game states, which could mean that performance still lags behind systems which are capable of tree search.)
For open-ended tasks like “build a company that will make a lot of money”, I guess that we will for the near future be unable to give enough data and train deep enough to get superhuman performance on this just with finetuning.
6.2.2 Scaffolding for amplification via reflection
Humans are able to benefit from time to reflect. Our slow answers to questions are often better than our snap judgements. But often we don’t actually get the time to reflect, and do act on the basis of our snap judgements.
Since “thinking time” can be very cheap for language models, if they could similarly benefit from extra reflection time, this could help them to boost their task performance significantly above their non-reflective performance. And if their non-reflective performance is approximating human performance, their reflective performance could naturally be superhuman. (Albeit if this were the only mechanism for getting superhuman performance, it might be capped at “what groups of humans going slowly and carefully could do”.)
Scaffolding provides a toolset to help facilitate this reflection. The language models of today already benefit from extra thinking time — they perform better when prompted to think out loud, and scaffolding techniques like running things multiple times and taking a vote can improve performance.
6.3 Recursive improvement and takeoff
An intelligence explosion based on language models would need a mechanism for recursive improvement — something that could repeatedly ratchet towards better performance, where improved performance would help with the next round of improvements.
6.3.1 Reflection-based takeoff
If more thinking time leads to better takes in a relatively unbounded way, this could be a mechanism for takeoff. The key threshold here is not “does performance increase with extra thinking time?” (a bar that language models already clear), but “can performance scale ~arbitrarily far with extra thinking time?” (a bar that humanity as a whole probably crosses, but the language models of today probably don’t).
Even if this bar is crossed, improvement isn’t automatically recursive. But if we know how to use extra compute to produce superhuman performance, we can then use that to construct new data sets to be approximated. These could be used as part of finetuning, or even to build new text corpora, which represent (initially modestly) superhuman levels of intelligence.
This, then, could be iterated. The hope would be that reflection by systems which are approximating smarter answers will be more effective, and lead to yet smarter answers. The system could gradually bootstrap its way to strong superintelligence — essentially continuing the process whereby 21st Century humans are in many ways meaningfully smarter than 11th Century humans.
I say “gradually”, but with large enough amounts of compute this process could potentially play out quickly. Here’s some hacky first-pass analysis:
Still, overall I think this could be thought of as something like “the slow, boring path to superintelligence”. Perhaps it will be the first one that works. But I think it’s a good likelihood that some other things help it to move faster.
6.3.2 Scaffolding-based takeoff
It’s unclear what the performance returns to better scaffolding will look like. At least right now, it seems like nobody has invested that much in building good scaffolding (compared to the investments in building good foundation models), so there might be low-hanging fruit remaining.
How good can scaffolding ever get? One thought is that perhaps a given foundation model has something like a level of “latent potential”, and ideal scaffolding unlocks that but never exceeds it. However, with the right scaffolding one could reimplement an arbitrary GOFAI; while wildly impractical, this is a thought experiment which demonstrates that there is no natural ceiling on capabilities imposed by the foundation model.
Scaffolding is a language-based construction, so language models could plausibly learn how to contribute to better scaffolding (which can then be experimented with, and could recursively feed into further improvements to scaffolding). We are therefore interested in a question like “what is the returns curve to investment on improving scaffolding?”, which is an empirical question. For some possible shapes of the curve, improvements to scaffolding could precipitate an intelligence explosion, gathering pace faster and faster as successive generations of scaffolding are more effective than the last at further improving the scaffolding. My guess is that the parameters don’t quite shake out that way, but this feels very guesswork-y for such an important parameter.
6.3.3 Finetuning-based takeoff
I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.
6.3.4 Mixed takeoffs
Perhaps most likely is that there is no single silver-bullet, but takeoff contains elements of all of these processes, and others, blended together in a vortex of increasing speed. e.g. as well as improved scaffolding feeding into improved reflection which can help with the next generation of scaffolding, improvements in AI performance could help to accelerate developments in chip fabrication, so that there are greater amounts of compute available to help this process run more quickly.
This should be faster than what we would get out of any single mechanism. The main reason we wouldn’t see such a mixed takeoff is if one of the components is individually so fast that it leaves everything else behind.
One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding. I’ll discuss further in a later section (on the bitter lesson).
6.3.5 Systems not built on language models
I’ve been considering recursive improvement for language models. But the general arguments for an intelligence explosion don’t assume anything like the particular form of language models. Whether or not an intelligence explosion based on language models is possible, it’s likely the case that an intelligence explosion based on other forms of AI technology will eventually be possible. (& the argument about things which exceed human level rapidly blowing past human level is more likely to directly apply to such technologies.)
Could this matter? Yes, in two possible worlds:
7. Language model agents and transparency
7.1 Where does agency come from?
Suppose we have an agent-like system built out of language models. The foundation models themselves weren’t agent-like. So where could the agency have “come from”?
I think the answer will be one, or a combination, of three possibilities:
I think we should have quite different attitudes towards these, from an AI safety perspective.
1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.
2) seems scary. This is the classic case of mesa-optimization. By default I’d think we should expect not to really understand the goals of agents that have been selected for this way. There may be clever work that could be done to ensure things are safe, but this is the kind of story that makes AI risk seem large and thorny.
3) seems promising. An agent built in this way would come with a massive amount of transparency-by-construction:
This is probably a vast volume of thoughts to handle, but everything is in a very legible form and we can probably take steps to automate oversight. In general: all the normal reasons people are keen on transparency make it seem like a great idea to try to use architecture which is extremely transparent. (This includes both wanting transparency to facilitate long-term AI safety, and wanting transparency to enable auditing of AI applications in the shorter term.)
In practice things may often use a combination of these. And a combination could be concerning: if we have top-level agency coming from 2), then we’re less able to trust the transparency from 3), since the system might have incentives to misrepresent its own thoughts.
7.2 Strategy: avoid selection pressure for agency
A lot of putative safety techniques are around assuming that we have something potentially dangerous and catching it. I think these are well worth investing (defence in depth seems valuable), but as a complementary strategy I’m pretty attracted to the idea that we should build systems where we have reason to believe that they shouldn’t have anything dangerous going on.
In the case of language model agents, this means: I think we should avoid any intensive search/selection processes towards high-level effectiveness of agents towards particular tasks. So far as possible we should aim for high-level agency to enter explicitly via scaffolding, and not via anything else.
Tentatively, I think this would mean:
Of course there’s a whole research agenda here. But I think that the basic point is straightforward and might be quite important to have broadly understood. I think this is somewhere where humanity by default makes systems which are selected to have agency (because we just try everything and see what works), but because the alternative of introducing agency via scaffolding is a pretty good substitute, it might be within political reach to build norms which exclude the problematic type of selection.
7.3 The bitter lesson?
Richard Sutton’s “bitter lesson” from 70 years of AI research is that building knowledge into AI agents may help in the short term but in the longer term is consistently overtaken by general-purpose methods that make use of more computation. This raises a couple of concerns about maintaining transparency:
Essentially, one might think that even if early scaffolded agents are more transparent, these will be obsoleted by more sophisticated AI which does end-to-end training for effectiveness over the entire system (including the scaffolding).
I take this concern to have some bite. I do think that a scaffolded agent which was purely optimized would be unlikely to have transparent internals. Nonetheless there are a few reasons why I don’t think the bitter lesson means that hopes for transparency are necessarily doomed:
8. Risks & strategies
8.1 A rough taxonomy of risks
There are several different points which might be dangerous. Here’s one way of slicing things up:
I could offer views about the relative degree of existential risk posed by these, or the degree to which we should be prioritizing them (where these come apart because we may have disproportionate leverage over some). But I’m really not very confident in my relative assessment, and I’m much more confident in a meta-level take, so I’ll restrict myself to that:
I think that all of these risks (and it’s quite possible I’m missing some) are potentially grave. I wouldn’t currently feel comfortable assigning less than 1% risk of existential catastrophe to any of them — easily enough that if correct it would justify massive attention to address.
I also think that the actions people should take to understand and mitigate the various risks are likely to differ significantly. I therefore think that it should be a significant priority to better characterize the various risks, to assess how large they are in absolute terms, and to produce plans which are targeted specifically at reducing that risk. This can then feed into better prioritization of actions across the space — it’s likely that we should have a portfolio which includes work well-targeted at a number of these different risks.
8.2 Example strategies for mitigating the different risks
Here are some brainstormed thoughts on strategies for the various things here, to start things off. Take them or leave them.
8.3 Thoughts on tactical implications
I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.
I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.