Review
The field of complex systems seems like a great source of ideas for interpretability and alignment. In lieu of a longer comment, I'll just leave this great review by Teehan et al. on emergent structures in LLMs. Section 3 in particular is great.
In Thought Experiments Provide a Third Anchor, Jacob Steinhardt wrote about the relative merits of a few different reference classes when it comes to reasoning and making predictions about future machine learning systems. He refers to these reference classes as ‘anchors’ and writes:
I, too, recently became curious about this complex systems ‘anchor’ and this post is an attempt to get down some of my thoughts about it.
In starting to think about this, there are two questions that I would want to know the answer to:
Perhaps the best thing that we could hope for would be that the answer to both questions were 'yes', and in such a way that there existed techniques, methods, insights etc. that have been applied to other complex systems and that we might be able to translate into a form that were directly applicable to the study of neural networks. But at the very least, a positive answer to both questions would presumably suggest that broader insights and observations about how to think about complex systems may help us direct out inquiries about future ML systems, by helping to guide us towards good questions, appropriate research directions or better predictions.
The answer to the first question might be 'no'. i.e. It might be the case that informally, the label 'complex system' is applied to too many different kinds of system, each with their own idiosyncrasies, to the extent that a single unified notion is not possible, or would be trivial and useless. To further complicate matters, one might reasonably ask if - during a certain period of recent history in the higher education funding landscape - the phrase 'complex system' became associated with a 'new', sexy-sounding kind of science that studied 'modern' things like the internet as a network or the global financial markets as a single entity and that the possibility of access to large research grants incentivized the labelling of various things as complex systems or 'complexity science'. The question has actually received a fair amount of attention from within the complex systems community itself and these discussions are consistent with the answer to it being 'yes'. It is entirely plausible to me that people who study complex systems have genuinely had to spend a long time trying to zero in on exactly how to characterize the commonality in the systems that they are studying and that a robust characterization does exist.
In What is a Complex System?, Ladyman, Lambert and Wiesner considered the question in considerable depth and in Measuring Complexity and their subsequent book (also titled What is a Complex System?), Ladyman and Wiesner went on to develop what they call "a framework for understanding 'complexity' ", which they claim "is applicable across the natural and social sciences''. I don't claim that their framework is perfect, and I'm sure that there are people who would have substantial disagreements with it but it is of the kind sought in question 1. In this post, we claim that an artificial neural network undergoing gradient-based training is a complex system, according to their framework. So, assuming a certain positive answer to question 1, we give a positive answer to question 2.
Acknowledgements. Work completed while a SERI MATS scholar. In preparing this work I benefitted from multiple conversations with Evan Hubinger, Jennifer Lin, Paul Colognese, Bilal Chughtai, Joe Collman and presumably many others(!). I am also grateful to Jacob Steinhardt for sharing with me a set of his slides on the subject.
The Basic Structure of the System
We start by going over the basics of how neural networks undergo gradient-based training in a way that centres the 'actions' of each individual neuron. This may seem tedious to some readers, but we have made the decision to try to explain things as fully as possible and to lean heavily into the particular perspective that we want to introduce (probably more so than is currently useful). One of the main shifts in perspective that we want to make is away from the idea that 'we' are the ones who 'act' to feed the network its input, see what the result is, compute the loss, and update the weights. Instead we want to view this process as the inherent dynamics of a system that is continuously interacting with inputs. Note in particular, that while the fact that the training is gradient-based is actually important for some of our arguments, the idea that the system is 'in training' is not, really. We should equally (or even to a greater extent) have in mind e.g. a fully 'deployed' system that is continuously doing on-line learning or some neural-network based agent that is continuously allowed to update and which receives its inputs by interacting with an environment.
The architecture of the system is a directed graph G=(V,E), the vertices of which we call neurons. In this setup, E⊂V×V and if (v,v′) is an edge, then it is directed from v to v′.
Neurons send and receive two different types of signal: Forwards signals and backwards signals. And there are certain distinguished types of neurons: There are input neurons, which do not send any backwards signals and which can only receive forward signals from outside of the system, and there are output neurons, which do not send any forwards signals and can only receive backwards signals from outside of the system. In general, whenever (v,v′) is an edge in the architecture, the weight w(v,v′) is a measure of how strong the interactions between v and v′ are, such that whenever either a forwards or backwards signal is sent between v and v′, if the sender sends the signal x, the 'amount' of signal that the receiver gets is w(v,v′)x.
A forward pass is initiated when each of the input neurons Ii, for i=1,…,d, receives an input signal xi from outside the system (the vector x∈Rd represents a data point). When this happens, Ii then sends the forward signal xi to every neuron v for which (Ii,v)∈E. Each such neuron v will receive w(Ii,v)xi. Generally, for a neuron v′ that is neither an input nor an output neuron, its involvement in a forward pass begins when it receives forwards signals from at least some of the neurons v for which (v,v′)∈E. When it does so, the neuron v′ aggregates the forwards signals it is receiving to get the quantity Fv′, then it adds on the bias bv′ to get the preactivation zv′:=Fv′+bv′, and then it applies the activation function φ to get its activation av′:=φ(zv′). Then v′ sends the forwards signal av′ to every v′′ for which (v′,v′′)∈E. This process takes place in what can be thought of as one time step of the forwards pass, from the point of view of v′. The neuron v′ must also 'store' the information φ′(zv′) for use in the backwards pass.
A backwards pass is initiated when each of the output neurons Oi, for i=1,…,d′, receives an error signal ei from outside the system. Precisely what this error signal is may depend on the exact training algorithm being used, but we imagine typically that the vector e∈Rd′ could be the gradient of the error, i.e. the gradient of the loss function on the most recent input, taken with respect to the outputted forwards signals. What then happens is that Oi sends the backwards signal ei to every neuron v for which (v,Oi)∈E (recall that edges are directed 'forwards' as written). Each such neuron v will receive w(v,Oi)ei. Generally, for a neuron v′ that is neither an input nor an output neuron, its involvement in a backward pass begins when it receives backwards signals from at least some of the neurons v′′ for which (v′,v′′)∈E. When this happens, the neuron v′ aggregates the backwards signals that it is receiving to get the quantity Bv′, then it multiplies this by φ′(zv′) to get the quantity δv′:=Bv′φ′(zv′). Then v′ sends the backwards signal δv′ to every v for which (v,v′)∈E.
Whenever a backwards signal is sent from v′′ to v′ during a backwards pass, the weight w(v′,v′′) changes, i.e. the backwards signal changes the strength of future interactions between the two neurons. In fact, the way in which the current strength of interaction affects the error is the product of the forwards signal and the backwards signal sent between the two neurons:
∂(error)∂w(v′,v′′)=δv′′av′.(†)Also note that we are not restricting our discussion to layered, nor even necessarily 'feedforward', architectures. Our discussions apply to general directed graph architectures. In particular, involving recurrence.
Necessary Conditions For Complexity
Ladyman and Wiesner's framework distinguishes between conditions for complexity and products of complexity and they write that "a complex system is a system that exhibits all of the conditions for complexity and at least one of the products emerging from the conditions.'" In this subsection we focus on the conditions:
So in order to demonstrate that an artificial neural network undergoing training is a complex system, it is necessary that we argue that it meets all of these conditions. In the course of doing so, we will also explain more about exactly what is meant by the less obvious terms.
Numerosity of elements and interactions
The artificial neurons in the architecture of a neural network are of course the individual elements of the complex system. At the time of writing, large models routinely have hundreds of thousands of artificial neurons. By contrast with a few examples of other complex systems: A beehive contains only around 100,000 bees; an ant colony only a few hundred thousand ants; and recent estimates put the number of websites in the single-digit billions. The amount of possible interaction between the elements of the system is given by the number of edges in the neural architecture. One often refers to large models by how many parameters they have and the largest models currently have up to around a trillion parameters. Since the vast majority of these parameters are weights (as opposed to biases), we can certainly say that the number of interactions in our system is also very large.
Disorder
The kind of disorder that is relevant to us is that of decentralised control. This may seem to be an esoteric interpretation of the word 'disorder' and I don't have much to say against that point of view, but it really is the case that the word is being used in a way that is not supposed to just be synonymous with 'randomness' or 'entropy'. It is supposed to get at a certain lack of correlation between the elements in the way that we now describe.
There is a relationship of decentralised control between the loss and the dynamics of the system at the level of its individual elements. This is best understood if we allow ourselves to be deliberately vague about what we mean by 'the loss', i.e. we will temporarily ignore the distinctions between the test loss, the training loss on the whole training set, the 'current' training loss etc. and just use L to denote 'the loss'. Often one thinks of L as a function on the parameter space, but we want to think of this as the state space of our system, i.e. we want to think of L as a function on the set S of possible states s∈S that the artificial neurons could be in (of course what we call the 'state' of a neuron really is just the information of its bias and the weights of the edges it belongs to, but again, we are trying to emphasize the shift in perspective as deliberately as possible). Since the parameters update during training, the state of the neurons depends on a time variable t, which we can think of as being discrete.
The behaviour of the loss curve t↦L(s(t)) is not the result of merely applying the fixed function L as a lens through which to view dynamics that are already inherent in the neurons, i.e. inherent in the map t↦s(t) itself. Instead, there is a control relationship at work: The way we often think about this in ML is that via whatever gradient-based training algorithm is being used, the loss function determines the parameter updates, and therefore in some sense it controls the dynamics of t↦s(t). However, we also want to simultaneously bear in mind the other point of view that the dynamics of t↦s(t) are the specific way that they are in order to cause L to decrease i.e. the neurons behave in a way that controls the value of L.
And this control relationship is decentralised, in the following sense: When we think of L as controlling s(t), the update that an individual neuron undergoes does not depend on or require knowledge of how the other neurons are going to be updated. As the displayed equation (†) in the previous subsection shows, the update to the weight w(v′,v′′) might only depend on the most recent signals sent between v and v′′. Similarly, note that when we think of s(t) as controlling L, we are imagining the individual neurons as somehow 'working together' to control the loss function, but not as the result of coordination, i.e. the neurons do not 'know' that the goal is to figure out how to collectively act to decrease the loss, and yet this is in some sense what is happening.
Feedback
The notion of feedback being referred to here is a specific kind of iteration of interactions. All it really means is that the way that an element of the system interacts with other elements at later times depends on how it interacted at earlier times. We want to point out that this is saying slightly more than just the fact that the weights change over time: Since both δv′′ and av′ appear in the equation (†), it means that the value of w(v′,v′′) at later times - i.e. the strength of interaction between v′ and v′′ at later times - doesn't just depend on what w(v′,v′′) was at earlier times, but on the actual signals av′ and δv′′ that were sent between the neurons.
Non-Equilibrium or Openness
The way that the term 'non-equilibrium' is being used here is by way of analogy to a thermodynamic system, which we say is in a state of non-equilibrium when there is a net influx of energy. In particular, such a system is not a closed system - it is 'open'. In our context, the role of energy is played by a combination of a) Training data that can form inputs to the network, and b) Error data that can enter the system via the output neurons.
This ends the discussion that a neural network in training exhibits each of Ladyman and Wiesner's necessary conditions for complexity.
Example. Themes of decentralised control and non-equilibrium can be seen in many systems, for example in beehives. The temperature of the beehive is affected by the temperature outside the beehive, so it is not closed a system in that sense. And when the beehive gets hot, each bee gets hot. And when any given bee gets hot, it starts behaving in a certain way. So, a net influx of heat that raises the temperature of the beehive affects - in some sense controls - the behaviour of the bees. But this is only true in decentralised sense: The temperature controls the behaviour of each bee independently and irrespective of the other bees. The remarkable thing is that as the beehive gets too hot, the behaviour that the heat has engendered in the bees does itself help the beehive to cool down, i.e. the bees seem to act collectively to regulate - to control - the internal temperature of the beehive. But again they are only 'working together' in a decentralised sense; they are not actually coordinating.
Products of Complexity
We now need to demonstrate that a neural network in training exhibits at least one of Ladyman and Wiesner's products of complexity. We will briefly discuss three: Robustness, self-organization, and nonlinearity.
Robustness
Robustness refers to the fact that the normal functioning of the system is robust to changes to a small fraction of the elements of the system. The system works in a distributed way such that deleting some number of neurons or weights will tend not ruin the essential features of the system, and certainly not those features that cause us to think of it as complex. This is a weaker criterion than the existence of so-called winning tickets in the context of the Lottery Ticket Hypothesis, but winning tickets do provide further evidence of the kind of robustness being referred to.
Self-Organization
This is the idea of relatively stable structures or behaviours arising out of the aggregate behaviour of the individual elements and their interactions. In some sense, the evidence we have for this is that gradient-based training of neural networks actually seems to work! When training on what we think of as a fixed distribution, you really can train a network until it finds a dynamic equilibrium, i.e. until the system reaches a stable state in which the loss curve has plateaued and the weights are barely changing at all. Moreover, interpretability work has increasingly given us reason to expect to be able find certain structures or behaviours encoded in the state of a system that has reached this dynamic equilibrium, even across different training runs or architectures.
Nonlinearity
The idea of a nonlinear equation, or system of equations (differential or otherwise), is a crisp one in mathematics and indeed the equations that describe the evolution of the parameters of a neural network in gradient-based training is a nonlinear set of equations. Is this all we need to say? On the one hand Ladyman and Wiesner point out that ''In the popular and philosophical literature on complex systems a lot of heat and very little light is liable to be generated by talk of linearity and non-linearity'' and that ''the discussion of complexity abounds with non-sequiturs involving nonlinearity.'' And yet on the other hand, they argue that ''nonlinearity must be considered an important part of the theory of complexity if there is to be one'' and that ''non-linearity in some guise, usually of dynamics, is at least a necessary part of some set of conditions that are jointly sufficient''. So it seems that the fact that the dynamics are nonlinear is enough to satisfy their criteria. This having been said, we will revisit the idea of nonlinearity in the next section, pointing out another (arguably more important) aspect of the system that acts as an example of its 'nonlinearity'.
Multiscale Dynamics
We want to spend the next few sections saying more about what the complex systems viewpoint means.
To get a handle on a complex system, we need to identify useful and understandable properties of the system. A property can be thought of as a function p on the state space S. For example, the average test loss of a neural network that we are training is a property of the system; it is a function p=Testloss:S→[0,∞) that takes as input a state s∈S and outputs the average test loss of the network whose parameters take those values specified by the state s. But we could equally be considering properties that are not mathematical properties of the network, but more to do with an AI system that the network is giving rise to. For example, properties such as 'the action that an agent is about to take' or 'the goal of the system' or binary properties such as whether or not the system is deceptively misaligned.
Typically, we think of a property as existing at a certain scale, in the following sense: Given some property p, we consider its fibres , i.e. for each possible value π that the property can take, we consider the set p−1(π) of states that give rise to that value. So, to continue our example, for the test loss we would be talking about the level sets of
Testloss:S→[0,∞).The fibres of a property p partition the state space S. The coarseness of this partition is what we call the scale of the property p. This is not a precise or rigorous notion but it is a very important one. At one extreme, the exact state of the system is itself property of the system. It is the identity map p:S→S. And so its fibres form as fine a partition of S as can possibly exist: Considering the state of the system is the smallest scale at which the system can be viewed. We say that the exact state is a microscopic property of the system and when we need to emphasize it, we will call S the set of microstates. On the other hand, for any given r∈(0,∞), there will in general be many different choices for the parameters of the network that result in the test loss being equal to r, i.e. Testloss−1(r)contains lots of different states, and so the partition made by the fibres is coarser. This means that the test loss is a larger scale - more macroscopic - property of the system and the partition
Σ={Testloss−1(r):r∈(0,∞)}is a collection of macrostates.
There are also dynamics at different scales, too: For any time t, there is always going to be a unique σ(t)∈Σ for which s(t)∈σ(t). This defines a dynamical system t↦σ(t) on the larger scale state space Σ. Thus we start to see the sense in which a complex system is composed of many different dynamical systems at many different scales, all overlaid.
Notice that a system does not 'come with' useful, natural or understandable macrostates and macroscopic properties already delineated; one of the issues in understanding a complex system is in actually identifying non-microscopic properties and states that we think are worthy of further study. But explaining a state or property of the system will require much more than identifying and studying the dynamics at its own scale. A key difficulty to appreciate is that in order to build an explanation of a particular state or an understanding of the relevant dynamics at a certain scale, we ought to expect to have to bring to bear properties across multiple interacting scales, both smaller and larger: There are dynamics at each given scale but there is in general no separation of scales; there are causal effects between different scales, directed both upwards and downwards in the hierarchy of scales.
Universality and Mean-Field Theories
At one extreme, it is sometimes possible to identify useful macroscopic properties that are independent of particular microscopic details or which can be found to exist across many systems with different microscopic structures. This phenomenon is called universality. In ML, this might refer to a property of a trained model that arises across many different training runs, architectures, or even tasks. But for a relatively 'pure' example of universality, consider the central limit theorem: There, the distribution of limn→∞√n(¯¯¯¯¯Xn−μ) is always a standard normal N(0,1), regardless of the particular distribution of the individual independent random variables X1,X2,… being used to construct the empirical mean ¯¯¯¯¯Xn=1n(X1+⋯+Xn). This is a particularly simple example because the random variables X1,X2,… do not interact with one another, but there are actually many variants of the theorem which allow for the Xi to have different distributions and/or some degree of dependence and still for some statement of the same flavour to emerge, i.e. the average of a large number of deviations from the mean follows a normal distribution. Another classic example of universality, and a slightly less trivial one, is in the theoretical treatment of the ideal gas laws: Classical macroscopic properties such as temperature and pressure arise from the dynamics of individual gas particles, but under the ideal gas assumptions, the equations governing these macroscopic properties do not depend on the microscopic details of the dynamics. e.g. The momentum of an individual particle is presumably in reality a very complicated function of all of the other particles' trajectories, but almost all of that complexity is irrelevant from the point of view of the relationships between the macroscopic properties we are interested in.
In general, while I expect universal properties to be part of the story of understanding large neural networks, it also seems likely to me that their relevance for alignment, for actually finding and producing safe AI systems may be limited. This is partly because we typically think of safe behaviour as only existing in a small subset of the space of possible AI systems that we will build and yet universal properties are of course those that in some sense 'always' arise. (As a throwaway remark, consider the 'security mindset': It's clearly not the case that software built by different teams, in many different languages using many different paradigms, just tends to be secure as a result of some universality phenomena).
The ideal gas laws and the central limit theorem are both examples that come from mean field theories. Like universality phenomena, mean field theories are indicative of or rely on a kind of separation of scales that tends to not to be applicable to or true for the kinds of systems we have in mind. But it may be instructive to consider what the failure of this way of thinking means.
When one is looking at a system made up of numerous, interacting elements, it sometimes makes sense to form something called a mean-field approximation, which is a certain type of simpler, non-interacting system (and one that is of course supposed to capture the essential features of the original system). Roughly speaking, in a mean-field approximation, the behaviour of each individual element is modelled via small fluctuations around its average behaviour. And consequently, what often happens when one works out the mathematics of the specific system being studied, is that the way a given element interacts with all of the other elements can be modelled by only keeping track of its 'average interaction' with other elements. To put it another way, we replace the interaction with all the other elements by a single interaction with a 'mean field'.
In some sense, it is a way of expanding the behaviour of the system about its average behaviour. A typical mean-field approach keeps track of only the first order deviations from the average behaviour and so it can be construed as a kind of linearization of the system around the average behaviour of its elements. So, the failure of approaches like this bear witness to a kind of essential nonlinearity in the system in the sense that these approaches can be said to fail when the 'linear part' of the system does not approximate the full system, or when there is no neat 'linear part' of the system in the first place.
One thing that these points suggest to me is that we are unlikely to find and understand useful macroscopic properties by using either linear macroscopic functions, such as averages or other kinds of simple aggregates or integrals, or by using methods that ignore higher order interactions between the elements of the system.
Downward Causation and Explanatory Reduction
In general, when we cannot rely on a 'separation of scales', we need to think carefully about how we are building up our understanding of the system. This is because although larger scale states are collections of microscopic states in a literal, formal sense, this does not mean that there are easy answers to questions such as: Is a given (non-microscopic) state caused by its more microscopic constituents? And: Can a given (non-microscopic) state be explained only in terms of its more microscopic constituents? In fact, we will essentially argue that the answer to both questions is 'no', or rather 'not necessarily'.
It is usually thought of as clear that there are causal effects from smaller scales to larger scales. The first question is about whether or not there also exist non-negligible causal affects in the 'downwards' direction, i.e. from larger scales to smaller scales. This is the idea of downward causation. The second question is concerned with the fact that the existence of a constitutive reduction does not imply the existence of a satisfactory explanatory reduction.
A full discussion of these ideas would take us too far afield into philosophy but we will give a few remarks. While causes are indeed usually thought of as different from explanations, like many interpretations of explanatory reduction, ours involves causality in some way (in particular, the two questions in the previous paragraph are not neatly separable). And we think of explanatory reduction as something trying to ''capture the flavor of actual scientific reductions'' as Sarkar's account does. Though since there is often something ''messy'' or ''semi-empirical'' about real scientific explanations, we ought not to expect especially crisp insights to come from discussing these things in the context of thinking practically about complex systems. However, we will briefly discuss some potentially concrete ways that these sorts of ideas are relevant.
The imposition of constraints imposed by the superposition of multiple interacting dynamical systems across scales is enough to illustrate the general idea of downward causation: Let Σ be a collection of macrostates and denote the dynamics of the system at this scale by t↦σ(t)∈Σ. We let S and t↦s(t) denote the microstates and the microscopic dynamics as usual. Suppose we are measuring the property p corresponding to the partition Σ and we see that it change from, say, σ(t)=σ1 to σ(t+1)=σ2≠σ1. We suppose that this change is something understandable at the scale of Σ, e.g. perhaps it is clear that the action given by p(σ1) ought to be followed by the action given by p(σ2) in some sequence of actions. Suppose that at the microscopic level, during the same time step, the state changed from s(t)=s1 to s(t+1)=s2. This microscopic change might in some literal sense be due to microscopic dynamics, but a more understandable explanation to: ''Why did the system go from state s1 to state s2?'' might have to include the way in which the understandable macroscopic change of state from σ1to σ2 constrained the microscopic behaviour, by insisting that s(t+1)∈σ2.
And if part of the reason for a given state can be due to larger scale properties, then we do not expect it to be possible to give fully satisfactory explanations for that state only in terms of smaller scale properties.
As an example, suppose I have a tiny box that is full of tiny balls. And suppose that the balls have a little bit of room to jiggle around and rearrange. You can imagine that really the box contains atoms of an ideal gas and that the underlying state space of my system is the individual atoms of this world. The box starts at point A. But then I put the box on the back of my truck and drive 10 miles away to point B. Label one of the balls in the box and imagine its full spacetime trajectory. What explains this trajectory and what explains the final position of the ball? A significant part of this explanation is surely that I moved the box from A to B? After all, this explains why the ball is anywhere near B at all. If one were trying to explain the position of the ball only in terms of microscopic dynamics, one would be in the absurd position of having to justify and explain one particular sequence of molecular states for the world, when all the while it just so happens to be one of an immeasurably large number of sequences of microstates that all map to me grabbing the box and driving it from A to B. So we see in this case, the fact that part of the cause of a state is top-down means that it is more difficult to find a completely satisfactory bottom-up explanation.
One of the broader difficulties that these issues create is with respect to what scales or properties (if any) are ontologically foundational in a complex system. The system is defined in terms of 'individual elements' but we discover that the interactions between scales muddy the notion that the individual elements of the system should be the building blocks of our understanding. On that note, we will end this more philosophical section.
Remarks
I am not an expert on any of the topics that I've touched on in this essay so it is likely that I have made mistakes, or am wrong about something, or have misrepresented something. Please don't hesitate to comment or get in touch if you notice such things.