Previously, I argued that we should expect future ML systems to often exhibit "emergent" behavior, where they acquire new capabilities that were not explicitly designed or intended, simply as a result of scaling. This was a special case of a general phenomenon in the physical sciences called More Is Different.
I care about this because I think AI will have a huge impact on society, and I want to forecast what future systems will be like so that I can steer things to be better. To that end, I find More Is Different to be troubling and disorienting. I’m inclined to forecast the future by looking at existing trends and asking what will happen if they continue, but we should instead expect new qualitative behaviors to arise all the time that are not an extrapolation of previous trends.
Given this, how can we predict what future systems will look like? For this, I find it helpful to think in terms of "anchors"---reference classes that are broadly analogous to future ML systems, which we can then use to make predictions.
The most obvious reference class for future ML systems is current ML systems---I'll call this the current ML anchor. I think this is indeed a pretty good starting point, but we’ve already seen that it fails to account for emergent capabilities.
What other anchors can we use? One intuitive approach would be to look for things that humans are good at but that current ML systems are bad at. This would include:
- Mastery of external tools (e.g. calculators, search engines, software, programming)
- Very efficient learning (e.g. reading a textbook once to learn a new subject)
- Long-term planning (e.g. being able to successfully achieve goals over months)
Models sufficiently far in the future will presumably have these sorts of capabilities. While this still leaves unknowns---for instance, we don't know how rapidly these capabilities will appear---it's still a useful complement to the current ML anchor. I'll call this the human anchor.
A problem with the human anchor is that it risks anthropomorphising ML by over-analogizing with human behavior. Anthropomorphic reasoning correctly gets a bad rap in ML, because it's very intuitively persuasive but has a mixed at best track record. This isn't a reason to abandon the human anchor, but it means we shouldn't be entirely satisfied with it.
This brings us to a third anchor, the optimization anchor, which I associate with the "Philosophy" or thought experiment approach that I've described previously. Here the idea is to think of ML systems as ideal optimizers and ask what a perfect optimizer would do in a given scenario. This is where Nick Bostrom's colorful description of a paperclip maximizer comes from, where an AI asked to make paperclips turns the entire planet into paperclip factories. To give some more prosaic examples:
- The optimization anchor would correctly predict imitative deception (Lin et al., 2021), since a system optimized to produce high-probability outputs has no intrinsic reason to be truthful.
- It also would observe that power-seeking is instrumentally useful for many different goals, and so predict that optimal policies (as well as sufficiently powerful neural networks) will tend to do so (Turner et al., 2021).
Ideas produced by the optimization anchor are often met with skepticism, because they often contradict the familiar current ML anchor, and they don't benefit from the intuitive appeal of the human anchor. But the differences from these other two anchors are precisely what make the optimization anchor valuable. If you (like me) feel that both the current ML and human anchors paint an incomplete picture, then you should want a third independent perspective.
The optimization anchor does have limitations. Since it abstracts ML into an ideal optimizer, it ignores most on-the-ground facts about neural networks. This can lead to underconstrained predictions, and to ignoring properties that I think will be necessary for successfully aligning ML systems with humans. I'll say more about this later, but some particularly important properties are that neural networks often generalize in "natural" ways, that we can introspect on network representations, and that training dynamics are smooth and continuous. Researchers focused on the optimization anchor don't entirely ignore these facts, but I think they tend to underemphasize them and are overly pessimistic as a result.
The Value of Thought Experiments
The optimization anchor points to the value of thought experiments more generally. While it poses the thought experiment of "What if AI were a perfect optimizer?", there are many other thought experiments that can provide insights that'd be hard to obtain from the ML or human anchors. In this sense thought experiments are not a single anchor but a generator for anchors, which seems pretty valuable.
One thought experiment that I particularly like is: What happens if most of an agent's learning occurs not during gradient descent, but through in-context learning[1]? This is likely to happen eventually, as ML agents are rolled out over longer time horizons (think artificial digital assistants) and as ML improves at in-context learning. Once this does happen, it seems possible that agents' behavior will be controlled less by the "extrinsic" shaping of gradient descent and more by whatever "intrinsic" drives they happen to have[2]. This also seems like a change that could happen suddenly, since gradient descent is slow while in-context learning is fast.
It would be great if we had a community of researchers making thought experiments with clearly stated assumptions, explaining in detail the consequences of those assumptions and ideally connecting it to present-day research.
Other Anchors
There are many other anchors that could be helpful for predicting future ML systems. Non-human animal behavior could provide a broader reference class than humans alone. Evolution and the economy are both examples of powerful, distributed optimization processes. I am most excited about better understanding complex systems, which include biological systems, brains, organizations, economies, and ecosystems and thus subsume most of the reference classes discussed so far. It seems to me that complex systems have received little attention relative to their germaneness to ML. Indeed, emergence is itself a concept from complex systems theory that is useful for understanding recent ML developments.
Limitations of Thought Experiments
I've focused so far on predicting problems that we need to address. But at some point we actually have to solve the problems. In this regard thought experiments are weaker, since while they often point to important big-picture issues, in my view they fare poorly at getting the details right, which is needed for engineering progress. For instance, early thought experiments considered a single AI system that was much more powerful than any other contemporary technologies, while in reality there will likely be many ML systems with a continuous distribution of capabilities. More recent thought experiments impose discrete abstractions like "goals" and "objectives" that I don’t think will cleanly map onto real ML systems. Thus while thought experiments can point to general ideas for research, even mapping these ideas to the ontology of ML systems can be a difficult task.
As a result, while we can't blindly extrapolate empirical trends, we do need a concerted empirically-based effort to address future ML risks. I'll explain why I think this is possible in a later post, but first I'll take us through an example of "taking a thought experiment seriously", and what it implies about possible failure modes of ML systems.
In-context learning refers to learning that occurs during a single "rollout" of a model. The most famous example is GPT-3's ability to learn new tasks after conditioning on a small number of examples. ↩︎
While this statement borders on anthropomorphizing, I think it is actually justified. For instance, depending on the training objective, many agents will likely have a "drive" towards information-gathering, among others. ↩︎
Yeah, my take is that there are humans with the kinds of goals and motivations that we'd be OK with an AGI having, and there are also humans with the kinds of goals and motivations that we absolutely don't want our AGIs to have. Therefore "AGIs with human-like motivation systems" is a potentially promising area to explore, but OTOH "We'll build an AGI with a human-like motivation system, and then call it a day" is not sufficient, we need to do extra work beyond just that.
Well, yes, people will probably try, but that doesn't mean they'll succeed :-P
I think this claim is mixing up evolution versus within-lifetime learning, or else maybe I just disagree with you.
To grossly oversimplify, my model is: "motivations are mostly built by evolution, intelligence is mostly built by within-lifetime learning".
So the evolution-of-humans model would be: there's an outer-loop selection process not only selecting the within-lifetime neural architectures and hyperparameters and learning rules etc. but also selecting the within-lifetime reward function, and then the within-lifetime learning happens in an inner loop, or in ML terminology, as a single super-long episode with online learning. And over the course of that episode the agent gradually learns all about itself and the world, and winds up pursuing goals. The within-lifetime reward function is not exactly the same as those goals, but is upstream of those goals and closely related. (I'm oversimplifying in lots of ways, but I think this is spiritually correct.)
Example: Suppose a human is trying to stay alive. There are two ways that this might have happened. The first story is that the outer-loop (evolutionary) selection has endowed humans with a within-lifetime reward function that (by and large) tends to make humans have drives related to staying alive, as an end in itself. ("I dunno, I just love being alive", they say.) The second story involves the inner loop, instrumental convergence, and means-end reasoning: Imagine a person saying "I'm old and tired and sick, but dammit I refuse to die until I finish writing my novel!"
I think when people say "human values", they're mainly talking about the genetically-endowed drives that come from the within-lifetime reward function, not the instrumental goals that pop up via means-end reasoning. So the question is: would the AGI's "innate" reward function for within-lifetime learning resemble a human's?
Well, IF we copy evolution, and select our AGI's within-lifetime reward function by a blind outer-loop search, I think we're likely to wind up with AGIs whose "innate" drives include things like curiosity and self-preservation and power-seeking, which are also recognizable as human innate drives. I think this is the point you were trying to make. (Sorry if I'm misunderstanding.)
But there are two key points.
First, I think this is a terrible way to make a within-lifetime reward function, and we don't have to do it that way, and we probably won't. The within-lifetime reward function is the single most important ingredient in AGI safety and alignment. The last thing I would do is pick it by a blind search. And I don't think we need to. I propose that we should write the within-lifetime reward function ourselves, by hand. After all, ML researchers hand-code reward functions all the time, and the agents do just fine. Even when the reward function is picked based on the whims of some ancient Atari game developer, the agents still usually do fine. Putting curiosity into the reward function helps, but we already know how to add curiosity into a reward function, we can copy the formula from arxiv, we don't need to do a blind-search-over-within-lifetime-reward-functions.
Second, even if we did pick the AGI's within-lifetime reward function by an evolution-like outer-loop search, the AGI probably wouldn't wind up with human-like social instincts. That's important—I think human social instincts underlie pretty much all of human moral intuitions, and spill into practically every other part of human behavior. I suppose you could get some kind of AGI social instincts by doing your outer-loop search in a simulation where multiple AGIs can work together and have iterated prisoners' dilemmas etc. But social interactions per se are insufficient—human-like social instincts are different from wolf-like social instincts or bonobo-like social instincts etc. I think there are a lot of possible equilibria, and it depends on idiosyncrasies of their social environment and power dynamics (details). Also, even if you somehow nailed it, you would get AGIs with human-like social instinct towards the other AGIs in the simulation, and not necessarily towards humans.