In his List of Lethalities, Eliezer writes that "matrices [in neural networks] are opaque". But I think our understanding of how neural networks actually work has improved significantly last year, with the deep learning theory book. I will briefly outline the general idea of the book below, in case closing this knowledge gap inspires any viable path towards alignment.
Everything that follows is paraphrased from Roberts, Yaida, and Hanin's book, unless I indicate otherwise. To understand the book, you need decent linear algebra and analysis skills, with a sprinkle of information theory. Most of the difficulty comes from the length of the calculations, which requires some practice to get used to. Luckily, the book is exceptionally well written and guides the reader through the math step by step.
Effective theories
Artificial neural networks are usually described in terms of activations, weights, and biases. This is the "atomic" view. But just like describing a gas in terms of the individual atoms that it is made of, the atomic view of neural networks is too fine-grained for most practical purposes. To understand neural networks, we need to know the effective degrees of freedom of the network - akin to temperature, pressure, and volume of a gas. This is not just practical, but it is essential for a true understanding of anything. You haven't fully understood a gas until you have derived the ideal gas law - even if you know all about the quantum field theory that ultimately implies it.
A beautiful thing about nature is that effective theories work. While nature always operates on the lowest level, most of these low-level things often don't matter for the high-level description. Physicists use perturbation theory and "renormalisation group flow" to successfully link the low-level description (e.g. atoms) with the high-level one (e.g. a gas). This not only gives you better concepts to work with at large scales, but also tells you when those concepts will break down and what to do in this case. The idea in the book is to apply the physicist's tools to find an effective theory description of neural networks.
The ensemble
The weights and biases of a neural network are typically initialised randomly before training. During training, all of these parameters are updated (e.g. step by step via gradient descent) until the network produces a desired output from a given input. The function that the trained network represents therefore usually depends on the particular initialisation that we started with[1]. To fully understand neural networks, it is essential to think in terms of an ensemble of networks over all possible initialisations, instead of a single trained network.
So we are dealing with a probability distribution over network outputs (those of the final layer or any intermediate ones), given an input and an initialisation distribution. To build an effective theory of a feed-forward network, Roberts et al. mainly do three things:
- Start with the infinite layer width limit and then do perturbation theory on the width to get to large but finite width
- Use the recursion relation between layers
- Marginalise over the parameter initialisation distribution (whenever this is desirable)
All the math is pretty standard in physics, and as far as I can tell it all checks out (but I haven't done all the calculations myself, yet). What we get in the end is an effective description of a feed-forward network that lets you derive all kinds of interesting things.
Over-parametrisation and generalisation
Modern deep neural networks are usually trained to convergence, where the training error vanishes. Here, we are working in the over-parametrised regime, where the network has many more parameters than are necessary to describe the training data. (See pp. 391 of the book for why this is not in conflict with Occam's razor.) As a result, the loss function (what we try to minimise during training) does not have a single global minimum, but a high-dimensional sub-manifold of global minima. The art of building and training neural networks lies in finding not just any global minimum of the loss function, but the global minimum that also minimises the error on the unseen test data.
Roberts et al. express this error on the test set explicitly with the effective theory and show how it is affected by the choice of initialisation and learning algorithm. Specifically, they show how an object called the neural tangent kernel is the main driver of the function-approximation dynamics and how its components determine the generalisation behaviour of the model. They also show explicitly how and why representation learning works.
The theory described in the book is derived for feed-forward and residual networks, but the same techniques should apply to transformers and any other kind of network that has some recursive / modular structure. Doing this is a lot of hard work, but it can be fun if you like physics.
Dealing with the distributional shift
The question is now if we can express the change of a network's behaviour under distributional shift of the input data. If we can do this, maybe we can say something that might help with alignment. Having said this, such a research direction is probably advancing capability more than alignment, as usual.
- ^
In the over-parametrised, non-convex regime that we are interested in.
Hmm, you may be right, sorry. I somehow read the opaqueness problem as a sub-problem of lie-detection. To do lie-detection we need to formulate mathematically what lying means, and for that we need theoretical understanding of what's going on in a neural net in the first place, so we have the right concepts to work with.
I think lie-detection in general is very hard, although it might be tractable in specific cases. The general problem seems hard because I find it difficult to define lying mathematically. Thinking about it for five minutes I hit several dead ends. The "best" one was this: If the agent (for lack of a better term) lies, it would not be surprised about a contrary outcome. That is, I think it would be a bad sign if the agent wasn't surprised to find me dead tomorrow, despite stating the contrary. And surprisal is something that we have an information-theoretical handle on. However, even if we could design the agent such that we can feed it with such input that it actually "believes" it is tomorrow and I am dead (even though it is today and I am still alive), we would still need to distinguish surprisal about the fact that I'm dead and surprisal about the way the operator has formulated the question or any other thing. (A clever agent might expect the operator to ask this question and deliberately forget that one can ask the question in this particular way, so it'd be surprised to hear this formulation, etc.) The latter issue might become more tractable now that we better understand how and why representations are forming, so we could potentially distinguish surprisal about form and surprisal about content. I still see this as a probable dead end because of the "make it believe" part. If a solution exists, I expect it to be specific to a particular agent architecture.