Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.
For instance, suppose that we’re trying to identify the concept of a strawberry inside a solomonoff inductor: We know that once we identify the concept of a strawberry inside a solomonoff inductor, it needs to continue to work even when the solomonoff inductor updates to new potential hypotheses about the world (e.g. we want the concept of a strawberry to still be there even when the inductor learns about QFT). This means that we’re looking for redundant information that is present in a wide variety of potential likely hypothesis given our observations, so instead of working with all the individual TMs, we can try to capture the redundant information shared across a wide variety of TMs consistent with our existing observations (& we expect the concept of a strawberry to be part of that redundant information, as opposed to the information specific to any particular hypothesis)
This obviously doesn’t get us all the way there but I think it’s an existence proof for cutting down the search space for “human-like concepts” without sacrificing the expressivity of the world model, by reasoning about what parts of the world model could correspond to human-like concepts
A desirable property of an AI’s world model is that you as its programmer have an idea what’s going on inside. It would be good if you could point to a part of the world model and say, “This here encodes the concept of a strawberry; here is how this is linked with other concepts; here you can see where the world model is aware of individual strawberries in the world.” This seems, for example, useful for directly specifying goals in the world – like “go and produce diamonds” – without having to do reinforcement learning or some other kind of indirect goal learning; if we knew how to find diamonds in the AI’s world model, we could directly write a goal function.
But you won’t get this kind of understandable world model by default, if you use something like Solomonoff induction Turing machines or gradient descent on transformers. Both of these will produce a model for describing the world, but you cannot easily look inside. With sufficient effort, you can reverse-engineer what is happening in the Turing machine (keeping in mind a Turing machine’s unlimited ability to do meta-programming) or in the giant floating-point array, but you might think that there should be a better way.
You might think, can’t we just learn the world model as something that’s not a Turing machine or a floating-point array? Perhaps a data structure that we can impose some semantics on, and that we therefore know how to interpret?
Well, it’s not so easy, actually. Let’s go through an example.
Consider a simple Solomonoff induction over Turing machines:
We get observations from the world, and feed them to the Turing machine, which then predicts the next observation. We have a prior over all Turing machines, and then we restrict our posterior to those that make correct predictions given past observations. The TM handles both the state and the dynamics of the world. Understanding what is happening within the TM is very difficult.
Here is one idea to get a more interpretable world model: instead of finding a Turing machine that has free reign over how to describe the state and the dynamics of the world, we’ll fix the world dynamics (i.e., the time evolution function) to a function we wrote ourselves based on our understanding of physics, and then we’ll let the induction algorithm search only for the initial state, and some bridging laws. The hope is that because the dynamics have fixed semantics, the state ends up in a to us understandable format.
For example, we could take a quantum chemistry simulation and then search over initial states for this simulation software, such that the software is able to predict the observations we have collected:
“QC” is the quantum chemistry software, which is fixed. We search only over possible initial states and possible bridging laws (where the bridging laws are the same for each time step). We do not feed in any observations at runtime; all information for making correct observations is required to be contained within the initial state.
We went from "searching over Turing machines to predict the next observation from past observations" to "searching over initial states and bridging laws such that all observations can be predicted by running the QC software".
We should then be able to inspect the state at each time step because we know how to interpret the state because we wrote the QC software. We should, for example, be able to spot diamonds within that state.
There are some obvious limitations to this: a naïve search will be uncomputable; quantum chemistry has approximation errors and doesn’t take into account gravity (which doesn’t matter for chemistry but does matter if we want to describe the entire planet). But maybe this is at least a small step up from doing Solomonoff induction on Turing machines? Assuming we have the ungodly amount of compute needed to simulate (at least!) the entire planet on the quantum chemistry level?
Can you spot the problem with this?
…
…
…
A quantum chemistry simulation is surely Turing-complete.[1] So what makes you think that the search algorithm will use the QC software with the intended semantics?
Surely something like the following will happen: the search algorithm constructs a computer in the initial state which is stepped forward by the QC software. Then the bridging laws read off the state of the computer and return that as the prediction. The computer can implement a much better QFT approximation that doesn’t simulate all particles at the same level of detail – as we’d be forced to do when using the QC software as intended – and so these predictions would be more efficient and likely more accurate than what QC could ever do.
The search algorithm has abused our structure to implement something we didn’t intend.
This specific failure mode where a computer is constructed in the initial state can perhaps be prevented by imposing some constraints, but I’d predict you still wouldn’t get your intended semantics. The number of initial states and bridging laws that use the QC time evolution as intended seems small compared to those that don’t use it as intended. It’s not that the search algorithm hates you and makes your life difficult on purpose – it’s just that it will not magically guess what you wanted.
The problem is most severe when searching over data structures that are Turing-complete – and as we know, Turing machines can appear in surprising places – but above some threshold of expressive power anything can be abused. When you look at your data structure, while having the intended semantics in your mind, you might not spot the problem, but the optimizer will find it.
I’m not saying that it’s impossible and that you should give up on trying to learn structures with semantics. I’m, like, pretty sure it’s possible. And it seems like something you need, in order to create a safe superintelligence. But it requires a lot more work than just defining the structure and assigning meanings to the parts.
Consider one of the successful cases of restricting the optimizer’s search space: convolutional neural networks (CNNs). If you want your model to be translation-invariant, then a CNN will give you that. In fact, it can’t not give you that – it can only express translation-invariant solutions. This might be the level of restriction it takes to make your optimizer use the supplied structure as intended.
(This property of CNNs corresponds to a sort of extreme inductive bias, and there seems to be some kind of connection between inductive biases and making an optimizer use your structure as intended, but I don’t fully grasp it.)
A way to simply avoid the problem of an optimizer “abusing” the structure you gave it, is to abandon any ambition of giving the structure any semantics in the first place. This seems to mostly be the strategy in deep learning. In self-attention layers, for example, the parts get names like Key vector, Query vector and Value vector, implying perhaps some vague semantics, but there is – as far as I can tell – no real expectation associated with how the optimization process will use them. For example, I don’t think anybody called this machinery would be used for implementing addition in Fourier space in models trained to do modular addition, but it’s not really an abuse, because there was never much expected meaning behind it.
It seems possible to me that in a previous era there was more thought put into what different parts of the neural network architecture were supposed to mean. For example, in the LSTM paper from 1997, the authors talk about long- and short-term memory cells, and they have paragraphs like this (in Section 4):
I can’t remember any such section in a recent deep learning paper, but I don’t claim to know the literature well enough to draw strong conclusions here.
(Note, however, that 17 years later, Cho et al. proposed GRUs, which dropped about half of the mechanisms inside an LSTM cell and seem to work about as well. Which perhaps implies that LSTMs do not work for the reasons Hochreiter and Schmidhuber thought they do.)
One recent project in deep learning which did have high ambitions in assigning semantics was Anthropic’s interpretability team’s effort to make neurons monosemantic. We can say that these researchers had the opinion that the neurons in a neural network are intended to correspond to exactly one concept – which would be convenient for understanding what is going on inside of them. Alas, a trained neural network is not like that by default, so we can say that the optimization process did not use the structure as intended – the structure was abused. The research team proposed a new activation function to nudge the optimization process in the right direction, but this was ultimately unsuccessful, and IIUC, the team abandoned this research direction. (The new strategy seems to be to not make the neurons in the trained network be monosemantic, but to train a separate auto-encoder with monosemantic neurons.)
The general lesson is, if you hand an optimizer a structure that is meant to be used in a certain way, and you tell it to use this structure to build a model of the world, the optimizer will by default abuse this structure in ways you did not intend – not out of malice, but because it’s easier that way.
And if your structure is Turing complete, then god help you.
(Thanks to Vivek Hebbar for the main example in this post; thanks to Johannes C. Mayer for our discussions surrounding this issue; thanks to Simon Fischer for his comments on a draft.)
Technically, a Turning machine needs an infinite tape, which you cannot have within our finite simulation, but it’ll be pretty big in any case.