Review

Epistemic Status: Highly Speculative. I spent less than a day thinking about this in particular, and though I have spent a few months studying large language models, I have never trained a language model. I am likely wrong about many things. I have not seen research on this, so it may be useful for for someone to do a real deep dive.

Thanks to Anthony from the Center on Long Term Risk for sparking the discussion earlier today for this post. Also thanks to conversations with Evan Hubinger ~1 year ago that got me thinking about the topic previously.

Summary

My vague suspicions at the moment are somewhat along the lines of:

  • Training an initial model: moderate to low path-dependance
  • Running a model: high "prompt-dependance"
  • Reinforcement Learning a Model: moderate to high path-dependance.

Definitions of "low" and "high" seem somewhat arbitrary, but I guess what I mean is how different behaviours of the model can be. I expect some aspects to be quite path dependant, and others not so much. This is trying to quantify how many aspects might have path-dependance based on vibe.

Introduction

Path dependence is thinking about the "butterfly effect" for machine learning models. For highly path-dependant models, small changes in how a model is trained can lead to big differences in how it performs. If a model is highly path-dependant, then if we want to understand how our model will behave and make sure it's doing what we want, we need to pay attention to the nitty-gritty details of the training process, like the order in which it's learning things, or the random weights initialisation. And, if we want to influence the final outcome, we have to intervene early on in the training process.

I think having an understanding of path-dependance is likely useful, but have not really seen any empirical results on the topic. I think that in general, it is likely to depend on different training methods a lot, and in this post I will give some vague impressions I have on the path dependance of Large Language Models (LLMs).

In this case, I will also include "prompt-dependance" as another form of "path-dependance" when it comes to the actual outputs of the models, though this is not technically correct since it does not depend on the actual training of the model.

Initial Training of a Model

My Understanding: Low to Moderate Path-Dependance

So with Large Language Models at the moment, the main way they are trained it that you should have a very large dataset, randomise the order, and use each text exactly once. In practice, many datasets have a lot of duplicate data of things that are particularly common (possible example: transcripts of a well-know speech) though people try to avoid this. While this may seem there should be a large degree of path dependance, my general impression is that, at least in most current models, this does not happen that often. In general, LLMs can tend to struggle with niche facts, so I would perhaps expect that in some cases it learns a niche fact that it does not learn in another case, but the LLMs seems to be at least directionally accurate. (An example I have seen, is that it might say "X did Mathematics in Cambridge" instead of "X did Physics in Oxford", but compared to possibility space, it is not that far.)

I suspect that having a completely different dataset would impact the model outputs significantly, but from my understanding of path dependance, this does not particularly fall under the umbrella of path dependance, since it is modelling a completely different distribution. Though even in this case, I would suspect that in text from categories in the overlapping distribution, that the models would have similar looking outputs (though possibly the one trained on only that specific category could give somewhat more details.)

I also think that relative to the possibility space, most models are relatively stable in their possible options for outputs. Prompting with the name of the academic paper on a (non-fine-tuned) LLM can understand it is an academic title, but can follow up with the text of the paper, or the names of the authors and paper, followed by other references. or simply titles of other papers. I tried this briefly on GPT-3 (davinci) and GPT-NeoX, and both typically would try to continue with a paper, but often in different formats on different runs. What seemed to matter more to narrow down search space was what specific punctuation was placed after the title, such as <newline>, " -" or " ,". (More on this in the next section)

I would guess that things like activation functions and learning rate parameters would have difference in how "good" the model gets, but for different models of the same "goodness", and likely the internals will look different, but I doubt there is much that difference in the actual outputs.

This is partially motivated by the Simulators model of how LLMs work. While imperfect, it seems somewhat close, and much better description than any sort of "agentic" model of how LLMs work. In this model, the essential idea is that LLMs are not so much a unified model trying to write something, but rather, the LLM is using the information in the previous context window to try to model the person who wrote the words, and then simulate what they would be likely write next. In this model, the LLM is not so much learning a unified whole of a next token predictor, but rather it is building somewhat independant "simulacra" models, and using relevant information to update each of them appropriately.

So the things I would expect to have a moderate impact are things like:

  • Dataset contents and size
  • Tokenizer implementation
  • Major architectural differences (eg: RETRO-like models)

And some things I think are less impactful to the output:

  • Specific learning rate/other hyperparameters
  • Specific Random Initialisation of the Weights/Biases
  • Dataset random ordering

Though this is all speculative, I think that if you have a set of 3 models trained on the same dataset, possibly with slightly different hyperparameters, that the output to a well specified prompt would be pretty similar for the most part. I expect the random variations from non-zero temperature would be much larger than the differences due to the specifics of training for tasks similar to the distribution. I would also expect that for tasks that have neat circuits, that you would pretty much find quite similar transformer circuits for a lot of tasks. 

It is possible however, that since many circuits could use the same internal components, then there might be "clusters" of circuit space. I suspect however that the same task could be accomplished with multiple circuits, and that some are exponentially easier to find than others, but there might be some exceptions where there are two same-size circuits that accomplish the same task, or

Some exceptions where I might expect differences:

  • Niche specific facts only mentioned once it the dataset at the start vs the end.
  • The specific facts might be different for models with lower loss.
  • The specific layers in which circuits form.
  • Formatting would depend on how the data is scraped and tokenized.
  • Data way outside the training distribution. (for example, a single character highly repeated, like ".................")

So I think it does make a difference how you set up the model, but the difference in behaviour is likely much smaller compared to which prompt you choose. I also suspect most that of the same holds for fine-tuning (as long as they do not use reinforcement learning).

Running a (pre-trained) Model

My Understanding: High "Prompt-Dependance"

When it comes to running a model however, I think that specific input dependance is much higher. This is partially from just interacting with models some what, and also from other people. Depending on the way your sentences are phrased, it could think it is simulating one context, or it could think it is simulating another context.

For example, if you could prompt it with two different prompts:

  • "New Study on Effects of Climate Change on Farming in Europe"
  • "Meta-analysis of climate impacts and uncertainty on crop yields in Europe "

While both are titles to things answering the same information, one could get outputs that differ completely, since the first would simulate something like a news channel, and the latter might sound more like a scientific paper, and this could result in the facts it puts out to be completely different.

Examples of both the facts and styles being different with different prompts for the same information. Examples run on text-davinci-003

From talking to people who know a lot more than me about Prompt Engineering, even things like formatting newlines, spelling, punctuation and spaces can make a big difference to these sorts of things, depending on how it is formatted in the datasets. As described in the previous section, giving an academic title with different punctuation will make a big difference to the likely paths it could take.

This is one of the main reasons I think that initial training has relatively little path dependance. Since the prompting seems make such a large difference, and seems to model the dataset distribution quite well, I think that the majority of the difference in output is likely to depend almost exclusively on the prompts being put in.

Fine-Tuning LLMs with Reinforcement Learning (RL)

My Understanding: Moderate to High Path Dependance

My understanding of RL on language models is that the procedure is to take some possible inputs and generate responses. Depending on the answers, rate the outputs and backpropagate loss to account for this. Rating could be done automatically (eg: a text adventure game, math questions) or manually (eg: reinforcement learning with human feedback).

I have not yet done any reinforcement learning on language models in particular, but other times I from implementing RL in other settings I have learned it can be quite brittle and finicky. I think that RL learning on LLMs seems likely to also suffer from this, and that different initial answers could likely sway the model quite quickly. Since the model is generating it's own training set, the initial randomly-generated responses may happen to have somewhat random attributes (eg: tone/spelling/punctuation) be corellated with the correctness of the outputs, and this could lead to the model in the next epoch to reinforce cases where it uses these random attributes more, and so it could get reinforced until the end.

As an toy example, one could imagine getting 10 outputs, 2 of which are "correct", and happen both to have British English spelling. In this case, the model would learn that the output needs to be not only correct, but have British English spelling. from then on, it mostly only answers in British English spelling, and each time it reinforces the use of British English spelling.

While this doesn't seems like a particularly illustrative example, the main thing is that minor differences in the model are amplified later on in training.

Is suspect however, that there exist RL fine tuning tasks that are less path-dependant. Depending on the reinforcement learning, it could make it so that the model is more or less "path-dependant" on the specific inputs it is prompted with, at least within the training distribution. Outside the training distribution, I would expect that the random amplified behaviours could be quite wild between training runs.

Conclusion.

Again, this writeup is completely speculative on not particularly based on evidence, but from intuitions. I have not seen strong evidence for most of these claims, but I think that the ideas here are likely at least somewhat directionaly correct, and I think that this is an interesting topic people could potentially do some relatively informative tests relatively easily if one has the compute. One could even just look at the differences between similarly performing models of the same size, and come up with some sort of test for some of these things.

There might be existing studied into this which I have missed, or if not, I am sure there are people who likely have better intuitions than me on this, so I would be interested in hearing them.

References

"Path Dependence in ML Inductive Biases", by Vivek Hebbar, Evan Hubinger

"Simulators", by janus

"A Mathematical Framework for Transformer Circuits", by Nelson Elhage, Neel Nanda, Catherine Olsson, et al.

New Comment
2 comments, sorted by Click to highlight new comments since:

Excited to see people thinking about this! Importantly, there's an entire ML literature out there to get evidence from and ways to [keep] study[ing] this empirically. Some examples of the existing literature (also see Path dependence in ML inductive biases and How likely is deceptive alignment?): Linear Connectivity Reveals Generalization Strategies - on fine-tuning path-dependance, The Grammar-Learning Trajectories of Neural Language Models (and many references in that thread), Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets - on pre-training path-dependance. I can probably find many more references through my boorkmarks, if there's an interest for this.

[-]janusΩ253

Linear Connectivity Reveals Generalization Strategies suggests that models trained on the same data may fall into different basins associated with different generalization strategies depending on the init. If this is true for LLMs as well, this could potentially be a big deal. I would very much like to know whether that's the case, and if so, whether generalization basins are stable as models scale.