The ultimate limits of alignment will determine the shape of the long term future

beren

Epistemic status: shower thoughts

This is crossposted from my personal blog

The alignment problem is not new. Humanity and evolution have been grappling with the fundamental core of alignment -- making an agent optimize for the beliefs and values of another -- for the entirety of history. Any time anybody tries to get multiple people to work together in a coherent way to accomplish some specific goal there is an alignment problem. People have slightly differing values, goals, thoughts, information, circumstances, and interests and this heterogeneity prevents full cooperation and inevitably leads to the panopoly of failure modes studied in economics. These include principal agent problems, exploitation of information asymmetries, internal coalition politics, the 'organizational imperative', and so on. While the human alignment problem is very difficult and far from being solved, humanity has developed a suite of 'social technologies' which can increase effective alignment between individuals and reduce these coordination costs. Innovations such as states and governments, hierarchical organizations, property rights, standing armies, democracy, constitutions, and voting schemes, laws and courts, money and contracts, capitalism and the joint stock corporation, organized religion, ideology, propaganda etc can all be thought of as methods for tackling the human alignment problem. All of these innovations enable the behaviour and values of increasingly large numbers of people to be synchronized and pointed at some specific target. A vast amount of human progress and power comes from these social 'alignment' technologies which provide the concentration of resources and slack for technological progress to take place, rather than just our raw IQ.

In AI alignment, we face a very similar which is easier in some ways but harder in others. It is easier because while with human alignment we have to build structures that can handle existing human minds solely through controlling (some of) their input, in AI alignment we have direct control over the construction of the mind itself including its architecture, the entirety of its training data, its training process and, with perfect interpretability tools, the ability to monitor exactly what it is learning, how it is learning, and to directly edit and control its internal thoughts and knowledge. None of these abilities are (currently) possible with human minds ^[1]. In theory, this should mean that it is possible to align AI intelligences much better than we can align humans and that our abilities should scale much farther than current social technology. However, it is also harder because we have to align AI systems across much larger capabilities differentials than exist between humans. This means that if we expect AIs to reach extremely high levels of intelligence and capability then if we are stuck with current human alignment methods such as markets, governments etc, then we are likely doomed ^[2] since these approaches often rely on competitive equilibria existing between agents and break down in the limit of one agent becoming extremely powerful.

Importantly, at the macro-scale, these coordination costs caused by alignment failures determine the shape of civilization and history. Whenever there is an improvement to alignment, such as the first creation of governments and laws, or written language and then the printing press, or rapid communication technology such as telegraphy and radio, or the development of centralized bureaucratic governments with large standing armies, we have seen periods of upheavals with the better alignment technology spreading out and eventually 'conquering' other regions without this technology (either through direct conquest or other regions adapting the technology to compete). Moreover, our level of alignment ability determines the maximum size of large scale entities that can be supported ^[3]. We see this all the time historically. With better alignment technology, larger and more powerful states and organizations can exist stably. Moreover, large and powerful empires almost always collapse due to internal politics (i.e. coordination costs) and not (directly) due to an external threat.

An identical story plays out in economics along similar lines. We almost always see that the death or obselescence of existing large firms is caused by internal decay leading to a slow stagnation and eventual slide into irrelevancy rather than a direct death due either to competitors or an abrupt technological change. This is also analyzed explicitly in organizational economics and was first introduced by Coase in the 'theory of the firm'. He asked, given that decentralized market price coordination, why do we organize economic activity into firms, which are coherent economic entities with internals that are not determined by market mechanisms, at all? Why should not the economy be organized with everyone just as an independent economic agent contracting out their services on the market? The answer he famously proposed is 'transaction costs'. That is, there are fixed costs in contracting all work out to individual contractors such as the administrative overhead of managing this, search costs, legal costs, and fundamental issues with information asymmetry. These costs mean it is more efficient to vertically integrate many functions together into a combined economic unit which functions as a command economy internally.

But the converse question could instead be posed and is in fact more fundamental: Why do we have a decentralized market with many firms at all? Why is the whole economy not just organized into one gigantic 'firm'? The fundamental reason, again, is coordination costs. Theoretically, decentralized economic prices provide useful credit assignment signals to firms as to whether what they are doing is positive sum (profits) or negative sum (making losses). Without such corrective signals, huge economic misallocations of resources can appear and continue on indefinitely until the surrounding structure collapses. Moreover, these misallocations are not random but created by standard coordination costs such as imperfect information and principal agent problems. The largest costs are likely caused by principal agent problems ^[4]. Essentially, the agents making up the larger 'entity' such as an organization or government act in their own selfish interest as opposed to the global interest of the higher level entity. This is essentially a kind of organizational cancer. Without any corrective price signals, there is nothing stopping internal individuals or coalitions from siphoning resources away from productive uses and towards enriching themselves or building up their own internal power base. Since they siphon resources to themselves, they will typically out-resource and hence outcompete internal competitors who are aligned and hence spend large amounts of their own resources on the actual organizational mission. In a competitive equilibrium, the growth of these internal cancers is constrained by the need to stay competitive with external adversaries (in actuality, due to delayed feedback, we typically see overshoot and then collapse) and this eventually leads to the stagnation and death of organizations to be replaced with more internally aligned competitors with less developed cancers which eventually succumb to the same malady. However, without any competitive pressure at all, these cancers can grow and grow until almost all resources are consumed by them and hardly anything is spent on the putative organizational mission.

The fact that similar phenomena and effects show up time and time again in all kinds of different circumstances point to coordination costs being fundamental. Moreover, these costs appear to be the main limiting constraints on the scale of a single 'entity' such as a firm, an organization, or an empire. The purely 'physical' returns to scale are pretty much always positive. These give rise to positive feedback loops like more territory -> more resources -> larger armies -> can conquer more territory. These feedback loops could, if unhindered by internal coordination costs, continue on indefinitely until the entire universe is conquered by a single unified entity. This is the archetypal singleton ^[5]. The negative returns on scale that prevent this from happening come in the form of coordination costs. However, alignment technology is fundamentally about reducing these coordination costs. If we assume that there is some unknown but fixed alignment 'tech tree' and that at equilibrium a superintelligence or universe of superintelligences will have maxed out this tech tree, then we can quite straightforwardly see that the maximum level of alignment, or the minimum coordination costs possible relative to scale, will determine the maximum size of coherent 'entities' in the future and hence the long term equilibrium distribution of agents in the universe.

The fundamental constraint in the universe is communication time -- aka distance. Assuming no FTL technology is possible, then colonizing and exploiting the full volume of our lightcone will require keeping 'copies' of our civilization aligned despite communication lags of billions of years. This essentially requires perfect 'zero-shot' alignment, meaning that we can create agents, which exactly and perfectly perform whatever we want despite being completely independent superintelligences for an arbitrarily long amount of time, and even as they accumulate vast amounts of resources such as entire galaxies worth of matter and energy. Any entity, whether a paperclipper or some ensemble version of our civilization faces this fundamental issue. Hence, whether the universe is ultimately controlled by one unified entity, or by an incredible number of entities with diverse values (even if originating from a single entity before alignment breakdown) depends on how successful alignment technology can ultimately be.

If complete zero-shot alignment is possible then, given the positive physical returns to scale, and the lack of any countervailing coordination costs, the universe will eventually come to be dominated by a single unified entity. Whether this entity pursues some arbitrary goal or is just a pure power-seeker depends on how competitive was the equilibrium which it arose from. On the other hand, if zero-shot alignment is not possible, and divergence of values and goals is inevitable without correction, which cannot take place across astronomical distances, then we inevitably end up with a diverse universe of 'misaligned' agents who are each pursuing separate goals in their own separated regions of the universe. The volume of the region controlled by each entity will then depend essentially on how rapidly divergence and misalignment sets in among copies of the original agent as it expands.

A natural followup thought is the longtermist question of what kind of future is preferable in terms of human values. If alignment is fully solvable so coordination costs go to 0, then it seems almost inevitable that the lightcone will eventually become dominated by a single entity. If maintaining alignment and projecting power over astronomical distances is 'easy', then if we start out with an equilibrium of many initial competing agents, there will likely be very little slack and evolution will drive everyone towards pure power-seeking. No matter which of these agents ultimately wins and conquers the lightcone, it seems very unlikely that it will create much of value according to our current values. On the other hand, if humanity is alone among the stars, and we manage to created a single aligned superintelligence with a decisive initial strategic advantage, then we can parlay that into total control over the lightcone with the concomitant vast potential utility that entails. This future is thus very high variance.

On the other hand, if alignment is fundamentally difficult, then human values are never going to be expressed in a large proportion of the lightcone before they mutate away into something inhuman and alien. However, since power projection is hard, then the values expressed in these alien regions are likely to be the result of plenty of slack, and not be pure power-seeking. The real question, then, is how valuable (to us) is the kind of values that emerge from divergence from our original values?

^{^}
When these become possible -- for instance with ubiquitous brain-computer interfaces allowing governments or corporations to directly monitor your thoughts and potentially 'edit' them, there will be another big increase in human alignment and hence another potential increase in the size of centralized entities and their stability. This of course sounds very dystopian and bad, and it is, but unfortunately the competitive pressures and benefits of scale will likely tend towards this outcome.
^{^}
Perhaps a closer analogy is evolution's alignment problem where it often can evolve systems comprised of many potentially independent units which are nevertheless mostly aligned. Examples of this include multicellularity (perhaps the first true alignment problem) and the evolution of eusociality -- i.e. colony insects like bees. Here evolution also controls the fundamental makeup and developmental process of the individual units being aligned, just as we will control such things for AGI.
^{^}
Funnily enough, you can also see this phenomenon occur all the time in strategy games like Civilization or Paradox games. Almost always in such games 'playing wide' i.e. conquering/settling large territories is by far and away the dominant strategy, but then the game designers try to counteract this by adding direct maluses for growth by penalizing technology or stability. These maluses are essentially hacky ways to simulate the increasing coordination and alignment costs that must occur in larger organizations without being able to simulate the fine-grained details of e.g. the principal agent problems that actually cause these costs.
^{^}
Really, all coordination costs are downstream of principal agent -- i.e. misalignment -- problems. If everyone acted perfectly aligned to a goal, then no agents would exploit information asymmetries, transaction costs would be incredibly low, there would be no frictions as everybody would just switch to what is best for the overall goal, and so on.
^{^}
Interestingly, the paper clipper dominating the lightcone scenario requires both that a.) humanty cannot solve alignment, hence the creation of the paperclipper, but b.) that the paperclipper *can* solve alignment, such that it can fundamentally keep all of its copies, including those billions of lightyears away, perfectly aligned to the core paperclipping mission)

[-]Noosphere892y10

In AI alignment, we face a very similar, but fundamentally easier problem. This is because while with human alignment we have to build structures that can handle existing human minds solely through controlling (some of) their input, in AI alignment we have direct control over the construction of the mind itself including its architecture, the entirety of its training data, its training process and, with perfect interpretability tools, the ability to monitor exactly what it is learning, how it is learning, and to directly edit and control its internal thoughts and knowledge. None of these abilities are (currently) possible with human minds [1]. In theory, this should mean that it is possible to align AI intelligences much better than we can align humans and that our abilities should scale much farther than current social technology. Indeed, since we expect AIs to reach extremely high levels of intelligence and capability compared to humans, if we are stuck with current human alignment methods such as markets, governemnts etc, then we are likely doomed.

I think the problem is easier and harder in this case. A key challenge is that large capabilities differentials basically mean that the law or contracts might not matter much, and this reduces to the Prisoner's Dilemma, where misaligned goals means the rational action is to defect, not cooperate.

That's probably the simplest explanation of the hardness of the problem.

[-]beren2y10

Very fair point! I somehow forgot to add a counterpoint like this in there as I intended. Updated now

LESSWRONG
LW

34

The ultimate limits of alignment will determine the shape of the long term future

34

34