It is hard for me to give an honest, thorough, and charitable response to this post. Possibly no fault of the author: this has been a persistent problem for me on many Simulator Theory posts. I always come away with an impression of "I see interesting intuitions mixed with some conceptually/mathematically obscure restatements of existing content mixed with a lot of strained analogy-making mixed with a handful of claims that seem wrong or at least quite imprecise." I'll try to think about how to tease out these different components and offer better feedback, but I figured it'd be worth expressing my frustrated state of mind more directly first.
Maybe I should break this post down into different sections, because some of the remarks are about LLM Simulators, and some aren't.
Remarks about LLM Simulators: 7, 8, 9, 10, 12, 17
Other remarks : 1, 2, 3, 4, 5, 6, 11, 13, 14, 15, 16, 18
For people that are just reading cfoster0's comment and then skipping a read of the post, I recommend you still take a look. I think his comment is a bit unfair and seems more like a statement of frustration with LLM analysis in general than commentary on this post in particular.
Where is token deletion actually used in practice? My understanding was that since the context window of GPT4 was insanely long, users don't tend to actually exceed it such that you need to delete tokens. And I predict that basically all interesting behaviour happens in that 32K tokens without needing to account for deletion
Are there experiments with these models that show that they're capable of taking in much more text than the context window allows?
I've spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I'm tempted to defer to you.
But on the inside view:
The finite context window is really important. 32K is close enough to infinity for most use-cases, but that's because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.
Here's some fun things to fill the window with —
I think no matter how big the window gets, someone will work out how to utilise it. The problem is that the context window has grown faster than prompt engineering, so no one has realised yet how to properly utilise the 32K window. Moreover, the orgs (Anthropic, OpenAI, Google) are too much "let's adjust the weights" (fine-tuning, RLHF, etc), rather than "let's change the prompts".
If you've spent much time playing with long conversation with LLMs with 1k-8k context windows (early versions of GPT or most current open-source models), then you quickly become painfully aware of token deletion. While modern frontier LLMs have novel-length context windows, their recall does tend to get worse as the current context size increases, especially for things not at either the beginning or end of the context window, which is a different but similar effect.
LLM- wouldn't suffer from this, but any realistic LLM will, and I agree that there is going to be a motive to fill the context window with useful stuff for any problem where enough us3eful stuff to do so exits.
A few comments:
Related - context distillation / prompt compression, perhaps recursively too - Learning to Compress Prompts with Gist Tokens.
The new page will also resemble a random string composed of "0" and "1", so this process will continue indefinitely.
I don't think this is true: there will be patterns emerging, and anything that appeared randomly and looks slightly less than random will make future continuations even less random until the LLM converges to something full of patterns.
Another way to say it- if it was always outputting 50/50% for 0 and 1, and you were running the process indefinitely, it would necessarily go through a string of all zeros, and I don't think you think it would output 50/50 for 0 and 1 with a string full of zeros as the prompt.
Randomness is not stable. Pseudorandom-looking bitstrings don't remain pseudorandom-looking if you pick the continuation randomly enough times.
Yep, but it's statistically unlikely. It is easier for order to disappear than for order to emerge.
This is awesome! So far, I'm not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.
I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are mostly an enumeration of basic properties of finite-state Markov Chains, rather than something helpful for the analysis of LLMs in particular.
I am very excited to read your thoughts on the Preferred Decomposition Problem. Do you have thoughts on preferred decompositions of a premise into simulacra? There should likely be a distinction between μ-decomposition and s-decomposition (where, if I'm understanding correctly, refers to the set of premises, not simulacra, which is a bit confusing).
I suspect that, pragmatically, the choice of μ-decomposition should favor those premises that neatly factor into simulacra. And that the different premises in a particular μ-decomposition should share simulacra. You mention something similar in 10.C, but in the context of human experts rather than simulacra.
On a separate note, I think that is confusing notation because:
Thanks for writing this up. I think that you'll see a lot more discussion on smaller posts.
I think my definition of is correct. It's designed to abstract away all the messy implementation details of the ML architecture and ML training process.
Now, you can easily amend the definition to include an infinite context window . In fact, if you let then that's essentially an infinite context window. But it's unclear what optimal inference is supposed to look like when . When the context window is infinite (or very large) the internet corpus consists of a single datapoint.
Thanks, I found the post quite stimulating. Some questions and thoughts:
Is LLM dynamics ergodic? I.e. is the time average equal to , the average page vector?.
One potential issue with this formalisation is that you always assume a prompt of size (so you need to introduce artificial "null tokens" if the prompt is shorter) and you don't give special treatment to the token <|endoftext|>
. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until <|endoftext|>
). While a fixed block size is actually being used during training, the LLM is incentivised to disregard anything before <|endoftext|>
. So these two prompts should induce the same distribution:
Document about cats.<|endoftext|>My name is
;
Document about dogs.<|endoftext|>My name is
.
Your formalisation doesn't account for this symmetry.
Dennett is spelled with "tt".
Note that a softmax-based LLM will always put non-zero probability on every token. So there are no strictly absorbing states. You're careful enough to define absorbing states as "once you enter, you are unlikely to ever leave", but then your toy Waluigi model is implausible. A Waluigi can always switch back to a Luigi.
Recall that the Python primitive "sort" corresponds to a long segment of assembly code in the compiler.
This analogy is a bit off because Python isn't compiled, it's interpreted at runtime. Also, compilers don't output assembly language, they output binary machine code (assembly is what you use to write machine code by hand, basically). So it would be better to talk about C and machine code rather than Python and assembly.
Aside from that I thought that was a very interesting post with some potentially powerful ideas. I'm a little skeptical of how practical this kind of prompt-programming could be though because every new LLM (and probably every version of an LLM, fined-tuned or RLHF-ed differently) is like a new CPU architecture and would require a whole new "language/compiler" to be written for it. Perhaps these could be adapted in the same way that C has compilers for various CPU architectures, but it would be a lot of work unless it could be automated. Another issue is that the random nature of LLM evaluation means it wouldn't be very reliable unless you set temperature=0 which apparently tends to give weak results.
I think the behavior of LLMs in the long run might not be very interesting. Since the oldest tokens are continually being deleted, information is being lost and eventually it'll get stuck in a mumble loop. And the set of mumble loops seems much smaller and less interesting than the set of answers we could get in the short run.
I originally called this "GPT Dynamics" rather than "LLM Dynamics". However, I think the AI Alignment community should stop using "GPT" as a metanym for LLMs (large language models), to avoid promoting OpenAI relative to Anthropic and Conjecture.
I think you maybe meant "metonym" instead of "metanym".
The autoregressive language model which maps a prompt to a distribution over tokens .
should actually be ; I think you mean "the set of all strings constructed from the alphabet of tokens" and not "the set of all length strings constructed from the alphabet of tokens"?
You used the former meaning earlier for Remark 1:
Let be the set of possible tokens in our vocabulary. A language model (LLM) is given by a stochastic function mapping a prompt to a predicted token .
Realised later on, thanks.
I guess in this formalism you'd need to consider the empty string/similar null token a valid token, so the prompt/completion is prefixed/suffixed with empty strings (to pad to the size of the context window).
Otherwise, you'd need to define the domain as a union over the set of all strings with token lengths the context window.
Remark 2: "GPT" is ambiguous
We need to establish a clear conceptual distinction between two entities often referred to as "GPT" —
- The autoregressive language model which maps a prompt to a distribution over tokens .
- The dynamic system that emerges from stochastically generating tokens using while also deleting the start token
Don't conflate them! These two entities are distinct and must be treated as such. I've started calling the first entity "Static GPT" and the second entity "Dynamic GPT", but I'm open to alternative naming suggestions. It is crucial to distinguish these two entities clearly in our minds because they differ in two significant ways: capabilities and safety.
- Capabilities:
- Static GPT has limited capabilities since it consists of a single forward pass through a neural network and is only capable of computing functions that are O(1). In contrast, Dynamic GPT is practically Turing-complete, making it capable of computing a vast range of functions.
- Safety:
- If mechanistic interpretability is successful, then it might soon render Static GPT entirely predictable, explainable, controllable, and interpretable. However, this would not automatically extend to Dynamic GPT. This is because Static GPT describes the time evolution of Dynamic GPT, but even simple rules can produce highly complex systems.
- In my opinion, Static GPT is unlikely to possess agency, but Dynamic GPT has a higher likelihood of being agentic. An upcoming article will elaborate further on this point.
This remark is the most critical point in this article. While Static GPT and Dynamic GPT may seem similar, they are entirely different beasts.
To summarise:
Thanks, I've found this pretty insightful. In particular, I hadn't considered that even fully understanding static GPT doesn't necessarily bring you close to understanding dynamic GPT - this makes me update towards mechinterp being slightly less promising than I was thinking.
Quick note:
> a page-state can be entirely specified by 9628 digits or a 31 kB file.
I think it's a 31 kb file, but a 4 kB file?
LLMs are the most complicated entities that humanity has made, they are the compression of the sum total of all human history and knowledge, and they've existed for less than five years.
This seems to be in the same ballpark as a post I made a couple of years ago, World, mind, and learnability: A note on the metaphysical structure of the cosmos, and included in my working paper on GPT-3, GPT-3: Waterloo or Rubicon? Here be Dragons, Working Paper, pp. 23-26.
Those remarks begin something like this: There is no a priori reason to believe that world has to be learnable. But if it were not, then we wouldn’t exist, nor would (most?) animals. The existing world, thus, is learnable. The human sensorium and motor system are necessarily adapted to that learnable structure, whatever it is.
I am, at least provisionally, calling that learnable structure the metaphysical structure of the world. Moreover, since humans did not arise de novo that metaphysical structure must necessarily extend through the animal kingdom and, who knows, plants as well.
“How”, you might ask, “does this metaphysical structure of the world differ from the world’s physical structure?” I will say, again provisionally, for I am just now making this up, that it is a matter of intension rather than extension. Extensionally the physical and the metaphysical are one and the same. But intensionally, they are different. We think about them in different terms. We ask different things of them. They have different conceptual affordances. The physical world is meaningless; it is simply there. It is in the metaphysical world that we seek meaning.
I then go on to argue that, by virtue of the texts they absorb during training, LLMs come to approximate the metaphysical structure of the world.
So, what are the most interesting or useful simulacra that people have succeeded in eliciting from GPT-4 so far?
I tried writing a prompt designed to elicit a simulacrum of a Keeper out of dath ilan. I gave it a system message with the text of that wiki page, some excerpts from Eliezer's glowfic, and a story about how it was a Keeper that had just died in a plane crash and mysteriously woken up in a chatroom.
It behaved like what I would expect out of a low-to-mid-quality fanfic of a Keeper, but it ends up mostly reverting back to speaking in the generic LLM-assistant voice and style pretty quickly.
Perhaps it would work better to start with models that have been subject to less RLHF to make them behave like generic "helpful assistant" characters that everyone seems to like so much.
Relevant quote from the research paper: "gpt-4 has a context length of 8,192 tokens. We are also providing limited access to our 32,768–context (about 50 pages of text) version, gpt-4-32k, which will also be updated automatically over time".
I claim that if Dennet's Criterion justifies the realism about physical macro-objects, then it must also justify the realism about simulacra, so long as satisfies analogous structural properties.
simulacra : GPT :: objects : physics
I propose the term anglophysics for the hypothetical field of study whose existence you're implying here.
I only read up to remark 5.B before I got too distracted that remark 1 does not describe the GPT I interact with.
How did you come to the conclusion that the token deletion rule is to remove 1 token from the front?
The API exposed by OpenAI does not delete any tokens. If you exceed the context window, you receive an error and you are responsible for how to delete tokens to get back within it. (I believe, if I understand correctly, this is dynamic GPT, calculating one token at a time, but only appending to the end of the input tokens until it reaches a stop token or the completion length parameter. Prompt length + max completion length must be <= context length. Due to per token billing, the max completion length is usually much smaller than reaching the context limit, but I could see where the most interesting behavior for your purposes would be with a larger limit.)
The deletion rule I've been working with, langchain.memory.ConversationSummaryBufferMemory, is very different. When the threshold token count is exceeded, it separates the first n chat messages to get below the goal. It then runs GPT with a summarization prompt with those n messages included. The output of that summarization is then prepended to the original history's chat messages starting at n+1. This is far more selective in which history it is throwing away, which can have a large impact on behavior.
Langchain does have simpler rules that just throw away history, but they usually throw away an entire message at a time, not a single token at a time.
Why are you ignoring prompts much smaller than the context window? This appears to be the vast majority of prompts, because given the way the API works you need to leave room for the reply, and have some way to handle continuation if the reply hits the limit before it hits the stop token. The tokens past the stop token in the context window never seem to matter, though I have not investigated how they do that, i.e. do they force them all to zero or whatever.
These two entities are distinct and must be treated as such. I've started calling the first entity "Static GPT" and the second entity "Dynamic GPT", but I'm open to alternative naming suggestions.
After a bit of fiddling, GPT suggests "GPT Oracle" and "GPT Pandora".
Thanks for this.
Is anyone working on understanding LLM Dynamics or something adjacent? Is there early work that I should read? Are there any relevant people whose work I should follow?
GPT4's tentative summary:
Section 1: AI Safety-focused Summary
This article discusses the nature of large language models (LLMs) like GPT-3 and GPT-4, their capabilities, and their implications for AI alignment and safety. The author proposes that LLMs can be considered semiotic computers, with GPT-4 having a memory capacity similar to a Commodore 64. They argue that prompt engineering for LLMs is analogous to early programming, and as LLMs become more advanced, high-level prompting languages may emerge. The article also introduces the concept of simulacra realism, which posits that objects simulated on LLMs are real in the same sense as macroscopic physical objects. Lastly, it suggests adopting epistemic pluralism in studying LLMs, using multiple epistemic schemes that have proven valuable in understanding reality.
Section 2: Underlying Arguments and Illustrations
- LLMs as semiotic computers: The author compares GPT-4's memory capacity to a Commodore 64, suggesting that it functions as a Von Neumann architecture computer with a transition function (μ) acting as the CPU and the context window as memory.
- Prompt engineering: Prompt engineering for LLMs is similar to early programming with limited memory. As context windows expand, high-level prompting languages like EigenFlux may emerge, with the LLM acting on the prompt.
- Simulacra realism: The author argues that objects simulated on LLMs are real based on Dennet's Criterion, which states that the existence of a pattern depends on the usefulness of theories that admit it in their ontology. The author claims that if this criterion justifies realism about physical macro-objects, it must also justify realism about simulacra.
- Meta-LLMology and epistemic pluralism: The author proposes that since LLMs are a low-dimensional microcosm of reality, our epistemology of LLMs should be a microcosm of our epistemology of reality. This implies using multiple epistemic schemes to study LLMs, with each scheme providing valuable insights.
Section 3: Strengths and Weaknesses
Strengths:
- The analogy between LLMs and early computers highlights the potential for the development of high-level prompting languages and the challenges of prompt engineering.
- The concept of simulacra realism provides an interesting perspective on the nature of objects simulated by LLMs and their relation to reality.
- The call for epistemic pluralism emphasizes the need for diverse approaches to understand and study LLMs, which may lead to novel insights and solutions for AI alignment and safety.
Weaknesses:
- The comparison between LLMs and early computers may oversimplify the complexity and capabilities of LLMs.
- Simulacra realism, while thought-provoking, may not be universally accepted, and its implications for AI alignment and safety may be overstated.
- Epistemic pluralism, though useful, may not always provide clear guidance on which epistemic schemes to prioritize in the study of LLMs.
Section 4: Links to AI Alignment
- The analogy between LLMs and early computers can inform AI alignment research by providing insights into how to design high-level prompting languages that enable better control of LLM behaviors, which is crucial for alignment.
- The concept of simulacra realism suggests that understanding the underlying structure and properties of μ is essential for AI alignment, as it helps determine the behavior of LLMs.
- The proposal of epistemic pluralism in studying LLMs can contribute to AI alignment by encouraging researchers to explore diverse approaches, potentially leading to novel solutions and insights into AI safety challenges.
Status: Highly-compressed insights about LLMs. Includes exercises. Remark 3 and Remark 15 are the most important and entirely self-contained.
Remark 1: Token deletion
Let T be the set of possible tokens in our vocabulary. A language model (LLM) is given by a stochastic function μ:T∗→Δ(T) mapping a prompt (t1…tk) to a predicted token tk+1.
By iteratively appending the continuation to the prompt, the language model μ induces a stochastic function ¯μ:T∗→Δ(T∗) mapping a prompt (t1…tk) to (t1…tk+1).
Exercise: Does GPT implement the function ¯μ ?
Answer: No, GPT does not implement the function ¯μ . This is because at each step, GPT does two things:
This deletion step is a consequence of the finite context length.
It is easy for GPT-whisperers to focus entirely on the generation of tokens and forget about the deletion of tokens. This is because each GPT-variant will generate tokens in their own unique way, but they will delete tokens in exactly the same way, so people think of deletion as a tedious garbage collection that can be ignored. They assume that token deletion is an implementation detail that can be abstracted away, and it's better to imagine that GPT is extending the number of tokens in prompt indefinitely.
However, this is a mistake. Both these modifications to the prompt are important for understanding the behaviour of GPT. If GPT generated new tokens but did not delete old tokens then it would differ from actual GPT; likewise if GPT deleted old tokens but did not generate new tokens. Therefore if you consider only the generation-step, and ignore the deletion-step, you will draw incorrect conclusions about the behaviour of GPT and its variants.
The context window is your garden — not only must you sow the seeds of new ideas, but prune the overgrowth of outdated tokens.
Remark 2: "GPT" is ambiguous
We need to establish a clear conceptual distinction between two entities often referred to as "GPT" —
Don't conflate them! These two entities are distinct and must be treated as such. I've started calling the first entity "Static GPT" and the second entity "Dynamic GPT", but I'm open to alternative naming suggestions. It is crucial to distinguish these two entities clearly in our minds because they differ in two significant ways: capabilities and safety.
This remark is the most critical point in this article. While Static GPT and Dynamic GPT may seem similar, they are entirely different beasts.
Remark 3: Motivating LLM Dynamics
The AI alignment community has disproportionately focused on Static GPT compared to Dynamic GPT. Although existing LLM interpretability research has been valuable, it has concentrated primarily on analysing the static structure of the neural network, rather than the study of the dynamic behaviour of the system.
This is a mistake, because it is Dynamic GPT which is actually interacting with humans in the real world, and there are (weakly) emergent properties of Dynamic GPT that are relevant for safety.
While it's true that the behaviour of Dynamic GPT supervenes upon the behaviour of Static GPT, it's not necessarily the case that studying the latter is the best way to predict, explain, control, or interpret the former. To draw an analogy, the behaviour of Stockfish supervenes upon the behaviour of the CPU, but studying the CPU isn't necessarily the best way to understand the behaviour of Stockfish.
Remark 4: LLM Dynamics (the basics)
4.A. GPT is a finite-state Markov chain
We can think of GPT as a dynamic system with a state-space and a transition function.
Due to the finite context window, the system is a finite-state time-homogeneous Markov chain. The state-space is Tk, i.e., the set of all possible strings of k tokens. In GPT-3, the context window has a fixed size of k=2048 and there are T=50257 tokens.
We refer to the elements of Tk as page-states.
A page-state is a particular string of Tk tokens. There are roughly 109628 possible page-states since 502572048=1.123456×109628. This number may seem large, but it's actually quite small—a page-state can be entirely specified by 9628 digits or a 31 kB file.
The language model μ:Tk→Δ(T) induces a stochastic function Tk→Δ(Tk) given by (t1…tk)↦(t2…tk,G), where G is the discrete random variable over T with distribution determined by μ(⋅|t1,…tk).
Notice how this transition makes two changes to the page-state:
4.B. GPT is a sparse network between page-states
We can represent GPT as a sparse directed weighted network with the following features:
In other words, for every page-state x∈Tk, there are T different edges with source-node x. Each edge is labelled with a different token t∈T . The edge labelled with token thas target-node y, where y=(x2,…,xk,t) is a page-state, and the edge is also labelled with probability p=μ(t|x1…xk).
We can view Dynamic GPT as a random walk along this network.
4.C. GPT is a sparse matrix indexed by page-states
Let P be the transition matrix for this Markov chain, or equivalently the adjacency matrix of the sparse network.
Specifically, P is a particular 502572048-by-502572048 stochastic matrix. This is a massive network. Because P encodes the same information as GPT, you can think of GPT abstractly as this particular matrix P. However, actually storing the matrix would be obviously intractable, because P is a look-up table recording GPTs output on every page-state.
If x,y∈Tk are page-states, where x=(x0…xk) and y=(y0…yk), then Px,y∈[0,1] has the following value:
Note that P is a massively sparse matrix — for two randomly selected page-states x,y, there is an insignificant likelihood that Px,y≠0 because the k−1 tokens in the middle of the pages are unlikely to align correctly. In fact, the proportion of non-zero elements in the matrix P is (1T)k−1 , or about 4.5×10−9624 for GPT-3.
4.D. Matrix powers as multi-step continuation
Recall that P is the matrix such that Px,y is the probability that the page-state x will transition to page-step y in a single step.
Similarly, P2 is the matrix such that (P2)x,y is the probability that the page-state x will transition to page-state y in two steps, without considering the intermediary step.
It's important to note that (P2)x,y≠(Px,y)2 because (P2)x,y refers to the (x,y)-entry in the matrix power P2. In general, this generally differs from the power of the (x,y)-entry in the matrix P.
Due to the tokens in the middle of the page (which are neither generated nor deleted), the intermediary step of this two-step transition is uniquely determined by the first and third page-states.
Similarly, P3 is the matrix such that (P3)x,y is the probability that the page-state x will transition to page-state y in three steps, without taking into account the two intermediary steps. Once again, the tokens in the middle page-state ensure that both intermediary steps of this three-step transition are uniquely determined by the first and fourth page-states.
The fact that matrix powers give multi-step probabilities is a general property of finite-state time-homogeneous Markov chains. The matrix Pn provides the n-step transition matrix.
This pattern persists until the matrix Pk. The matrix Pk is the matrix such that (Pk)x,y is the probability that the page-state x will transition to page-state y in k steps.
That is, Pk transitions from one page to another which naturally follows on from it. The first token of page y will naturally continue from the last token of page x. For instance, Pk will take one page of Haskell code to another page of Haskell code, such that the concatenation of both pages forms a continuous and coherent Haskell script. The succeeding page begins where the previous one ended, and they generally share no common tokens.
Now, it is often more useful to consider Pk instead of P, because Pk serves as the transition matrix for a more intuitive dynamic system — namely, the dynamic process consisting of k steps of GPT generation.
The conceptual advantage of Pk is that it contains no redundant information. All the elements of Pk are non-zero.
In the diagram above, I've shown what this would look like for a context window k=3. After k steps, all the original tokens have been deleted, and all the remaining tokens have been generated.
(In a forthcoming article, we will link the roots of the matrix P to chain-of-thought prompting.)
4.E. Initial distributions
Suppose we randomly sampled an initial page-state x∈Tk and evolved the result using the Dynamic GPT. This generates a sequence of page-valued random variables (X(0),X(1),X(2),…).
Each page-valued random variable X(t) can be represented by a vector π(t)∈RTk with non-negative elements and such that ||π(t)||1=1.
Theorem: π(t)=π(0)Pt. This general property of Markov chains is known as the Chapman-Kolmogorov Equation.
Theorem: If limi→∞Pi exists, then a unique matrix P∞ exists, and limt→∞π(t)=π(0)P∞.
Definition: A distribution π is called stationary if π=πP. Observe that if π(t) is stationary, then π(t′)=π(t) for all t′≥t. For every finite-state Markov chain, we are guaranteed that at least one stationary distribution exists.
Remark 5: Mode collapse
5.A. Mode collapse in Dynamic GPT
"Mode collapse" is a somewhat ambiguous term referring to when Dynamic GPT becomes trapped in generating monotonous text that lacks interesting structure. By viewing GPT as a finite-state Markov chain, it's clear why mode collapse is an inevitable phenomenon.
We now present a non-exhaustive classification of mode collapse.
5.B. Absorbing states
Consider the prompt "00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", i.e. a string of 2048 zeros. Let's abbreviate this string as 0k.
Almost the entire probability mass of μ(⋅|0k) will lie on the token "0". If the temperature were set to zero, then GPT would return the most likely token prediction, which would be "0".
Note that the generation step alone would not make 0k an absorbing state. It is the combination of generation and deletion. Because of the deletion of the start token "0", the resulting prompt is identical to the previous prompt.
Suppose you continued to generate text using this prompt. You might hope that after exactly the 5000th copy of "0", GPT would return a different token, such as "congratulations for waiting 5000 tokens!"
However, your attempt would be in vain. You can see that GPT has returned 5000 copies of "0", but GPT itself only sees 2048 copies of "0". When you generate another token, it looks to you like a dynamic linear change from 5000 "0"s to 5001 "0"s. But from GPT's perspective, there is no change at all. It has gone from 2048 "0"s to 2048 "0"s. Its environment has frozen permanently. Physics has grounded to a halt.
This is the simplest kind of mode collapse.
Definition: An absorbing state is a page-state x∈Tk for which Px,x≈1.
Lemma: If Px,x≈1 then Pnx,x≈1 for all n. That is, if you reach an absorbing state, you are unlikely to ever leave.
5.C. Periodic orbits
Note that this is not the only type of mode collapse. There are types of mode collapse that don't correspond to an absorbing state.
For example, consider a page consisting of "the dog and the dog and the dog and the dog and the dog and [etc]".
GPT, when given this prompt, will get stuck in a continuous loop. However, GPT never enters an absorbing state. In fact, the system is transiting periodically between three different states.
Now, x1 almost always transitions to x2, and x2 almost always transitions to x3 and x3 almost always transitions back to x1. As a consequence, Dynamic GPT continually cycles through these three page-states. Although the resulting output is somewhat more engaging than an absorbing state, it remains dull due to its repetitive, small-period orbit.
This is another form of mode collapse.
Definition: An orbit is a short sequence of pages x1,…,xr such that Pxi,xj≈1 whenever j≡i+1(modr).
Lemma: If (x1,…,xr) is an r-periodic orbit, then (Pn)xi,xj≈j≡i+n(modr). That is, if you reach a periodic orbit, you are unlikely to ever leave.
Lemma: If x is an absorbing state, then (x…x) is a periodic orbit for every period r.
5.D. Absorbing sets
Note that not all mode collapse looks periodic.
For example, consider the page "0110001010100100110111010101011101 [etc]", i.e. a random sequence of length 2048, drawn from "0" and "1".
When GPT reads this page, it has about 50% likelihood of generating "0" and about 50% likelihood of generating "1". After adding the new token and deleting the start token, we return to a page which is for all practical purposes the same as the previous page. The new page will also resemble a random string composed of "0" and "1", so this process will continue indefinitely.
So once you enter the realm of "pseudorandom bitstrings", then it's unlikely that you leave. And because this realm is uninteresting, we will also consider pseudorandom bitstrings to be an example of mode-collapse.
Definition: An absorbing class is a set of pages X such that if x∈X and y∉X then Px,y≈0 . In other words, an absorbing class is a cluster of pages such that, once you enter the cluster, you are unlikely to ever leave.
Lemma: If X is an absorbing class, and if x∈X and y∉X then (Pn)x,y≈0 for all n∈N .
Lemma: If (x1…xr) is a periodic orbit with a period r, then the set {x1,…xr} is an absorbing set. Also if x is an absorbing state then {x} is an absorbing set. Also the entire state-space Tk is trivially an absorbing set.
5.E. Mode-collapse is subjective
Whether you call a particular absorbing class an example of mode-collapse is subjective, as it depends on what types of transitions you find uninteresting.
Is "12345678 [...]" sufficiently boring? If so, then that absorbing set is mode collapse.
Is "2 4 8 16 32 64 [...]" sufficiently boring? If so, then that absorbing set is mode collapse.
(If you had an objective notion of "interesting", perhaps appealing to time- or description- complexity, then you could construct an objective notion of mode-collapse.)
Remark 6: Waluigi absorbing sets
6.A. Waluigis are not mode-collapse
In The Waluigi Effect (mega-post), I conjectured that the misaligned rebellious simulacra would constitute absorbing sets in Dynamic GPT. I should clarify that they do not constitute the only absorbing sets, nor are they the most likely absorbing sets. For instance, mode-collapse of "the dog and ..." would be an absorbing set which isn't waluigi.
I don't classify the waluigi simulacra as mode-collapse because the absorbing set is too complex. I reserve the term "mode-collapse" specifically for situations where GPT produces boring output.
However, this is a subjective judgment, and others may argue that the Waluigi absorbing class is sufficiently boring to be classified as mode-collapse. Furthermore, it is possible that waluigi absorbing sets contain absorbing subsets which are sufficiently boring to count as mode-collapse.
6.B. A waluigi toy model
I will now present a toy model describing waluigis.
vocabulary, the friendly token "F", the unfriendly token "U", and the
null-token "N" that pads short prompts. So the state-space is Tk where T={F,U,N} and k is the context length.
In this toy model, regardless of the initial distribution π(0), the long-run distribution π(t)→(U…U) as t→∞. This is because (U…U) is the unique left-eigenvector of P.
6.C. Exercises (optional)
In other words, if the current page is just Fs, what is the probability that the system will transition to a page of just Fs?
In other words, if the current page is just Fs, what is the probability that the system will transition to a page of just Fs?
6.D. Effect of the context window on waluigis
In general, a longer context window k will make absorbing states "stickier".
The basic intuition is this: if the context window is short, then the total Bayesian evidence from the prompt is small, so the model retains more uncertainty over possible simulacra. This results in a flatter distribution μ(⋅|x), ensuring Px,y≉0 for any y compatible with x. However, with a longer context window, the Bayesian evidence increases, and the model becomes more confident about its predictions. Consequently, the distribution becomes more peaked, and Px,y≈0 may occur even when y is compatible with x.
6.E. Two probability spaces
Ensure we don't get confused — remember we are talking about two distinct probability spaces.
Under certain assumptions, if an LLM is situationally aware and well-calibrated, then these probability distributions will align.
Remark 7: Semiotic physics
7.A. High-level ontologies
To understand something big, we need to decompose it into smaller, more elementary building blocks. The best way to decompose Static GPT is with Chris Olah's circuits. But what would be the best way to decompose Dynamic GPT?
Well, we're in luck — our physical universe is also given by a stochastic dynamic process, and yet we can describe the physical universe (both the states and the dynamics) with a higher-level emergent ontology. This should be encouraging.
This suggests a particular strategy for LLM Dynamics —
I predict that the result of this process yields something like LLM Simulator Theory.
7.B. Inter-ontological bridging principles
Physical reality, at the fundamental level, is a vector in Hilbert space repeatedly multiplied by a unitary matrix. Nonetheless, the unitary matrix satisfies certain structural properties allowing a description in a higher-level ontology. That higher-level ontology includes everettian branching, superposition, spatial distance, spatial dimensions, particles, local interactions, macroscopic objects, and agents[2]. Additionally, there are bridging principles connect the low-level ontology to the high-level ontology.
The physical bridge principles will typically be non-generic, in the sense that most operators U will not permit any higher-level ontology, yet the particular operator U that acts on our universe does permit a higher-level ontology.
In a similar way, I expect that the LLM bridge principles will also be non-generic, in the sense that a randomly initialised transformer will not permit any higher-level ontology, yet the trained transformer μ does allow a higher-level ontology.
In David Wallace's Emergent Multiverse, or Sean Carrol's Mad-Dog Everettianism, they attempt to extract a higher-level physical ontology from Schrodinger's Equation. My goal is to use similar techniques to extract a higher-order semiotic ontology from μ.
7.C. Epistemological standards for bridging principles
The bridging principles serve to bridge two different ontologies —
There is no way to establish the bridging principles formally because there is no pre-existing definition of simulacra. The only way to establish the bridging principles is to check that the predictions from LLM Simulation Theory, when converted via this bridge, match the predictions from LLM Dynamic Theory, and vice-versa.
Remark 8: LLM Simulator Theory
8.A. Formalisation
LLM Simulator Theory states the following:
μ(tk+1|t1,…tk)=∑s∈SP(tk+1|(t1…tk),s)P(s|(t1…tk))
P(tk+1|(t1…tk),s)=μs(tk+1|t1…tk)
P(s|(t1…tk))=P((t1…tk)|s)×P(s)P((t1…tk))
We can unify Equations 1–4 with the following Equation 5 —
μ(tk+1|(t1,…tk))=1P((t1…tk))∑s∈SP(S)μs(t1…tk)μs(tk+1|(t1…tk))
In other words, μ(⋅|(t1,…tk) is a linear interpolation between distributions {μs(⋅|(t1,…tk)}s∈S.
We call the coefficients in the interpolation amplitudes, i.e. the terms P(S)μs(t1…tk)P(t1…tk).
8.B. A brief note on terminology
I'm open to alternative naming suggestions, but for the purposes of this article —
NB: Previously, I called the elements of S "simulacra", but that terminology is incorrect. The elements of S correspond to simulated stochastic universes, whereas the simulacra correspond to simulated stochastic objects inhabiting a simulated stochastic universe. Now, the simulated universes are themselves simulated objects (just as the physical universe itself is a physical object) but not all simulated objects are simulated universes (just as not all physical objects are a physical universes). Hence the distinction between simulacra (simulated stochastic objects) and premises (simulated stochastic universes).
Remark 9: Amplitudes are approximately martingale
If μ satisfied Equation 5 and the context window is infinite, then the expected change in the amplitudes will be exactly zero, due to the Conservation of Expectation. This means that when the context window is infinite (k=∞), the amplitudes of the premises in the superposition are martingales.
On the other hand, if the context window is finite, then the expected change in amplitudes of the superposition is approximately zero. However, for some prompts, the expected change can be non-zero.
Take, for example, the full prompt "311101011". The amplitude of bitstrings in the superposition is small because of the "3" at the start of the prompt. Yet we know that the "3" token will be deleted because of the finite context window, resulting in an ex-ante expected increase in the amplitude of bitstrings. Similarly, as per the Waluigi Effect, the amplitude of the waluigi simulacrum will expectantly increase.
Nonetheless, the amplitudes are approximately martingale, and this property will be used later in the report to solve the preferred decomposition problem of GPT Simulator Theory.
Remark 10: the Preferred Decomposition Problem
10.A. Non-uniqueness of decomposition
According to LLM Simulator Theory, the language model μ:Tk→Δ(T) decomposes into a linear interpolation of premises μs:Tk→Δ(T) such that μ=∑s∈Sαsμs and the amplitudes as update in an approximately Bayesian way.
However, this is claim is trivially true for some basis of {μs}s∈S. Two trivial decompositions are always available for GPT Simulator Theory —
As a result, LLM Simulator Theory is either ill-defined, trivial, or arbitrary.
10.B. A toy model for non-uniqueness
To illustrate the preferred decomposition problem, we'll consider the prompt "Alice tosses a coin 100 times and the results were". Let's call this prompt x.
The language model μ acting on this prompt induces a probability distribution over {H,T}100. Suppose that μ assigns a likelihood of 0.9×2−100 to every sequence, except fot the two sequences H…H100 times and T…T100 times which are each assigned a likelihood of 0.05+0.95×2−100.
That's where LLM Dynamics stops — we have a distribution over token-sequences, and that's all we can say.
But LLM Simulator Theory goes further — it decomposes μ into the superposition of three distinct premises — a fair dice, a H-biased dice, and a T-biased dice. Formally, μ=0.95μF+0.05μH+0.05μT where μF,μH,μT are the three distinct stochastic processes.
But how can LLM Simulator Theory emerge from LLM Dynamics? There are multiple ways of decomposing μ into different stochastic processes. Here are the two trivial decompositions:
We can decompose μ in many different ways and the non-uniqueness of decomposition poses an obstacle to reducing LLM Simulator Theory to LLM Dynamics. The obstacle is that LLM Dynamics provides us with μ but is indifferent to a particular decomposition, whereas LLM Simulator Theory prefers a particular decomposition. So we need a bridging principle from LLM Dynamics to LLM Simulator Theory which breaks the symmetry between decompositions.
This is what I call the problem of preferred decomposition, and addressing the problem of preferred decomposition is the central conceptual/technical difficulty in linking LLM Dynamics to LLM Simulator Theory.
10.C. Pragmatic solutions
Some solutions to the Preferred Decomposition Problem are purely pragmatic. According to these solutions, the "preferred" decomposition of μ is whichever decomposition aids human interpreters and prompt engineers in understanding and analyzing the behaviour of the language model.
Picture this: you have three experts in a room. One who knows about fair coins, one who knows about H-biased coins, and one who knows about T-biased coins. If we decompose μ into 0.9μF+0.05μH+0.05μT, then we can send a prompt to each of the experts and then interpolate their predicted continuations. According to the pragmaticist, the preferred decomposition is whichever decomposition facilitates this division of epistemic labour.
Here is a more realistic example —
Suppose we feed GPT a dialogue between Julius Caesar and Cicero discussing their mothers. By what criterion can we say that there is a Caesar–simulacrum talking to a Cicero–simulacrum? Well, the pragmatist's criterion is satisfied if and only if the people who can best predict/explain/control the logits layer are historians in the Late Roman Republic.
Under the pragmatic solution to the preferred decomposition problem, LLM Simulator Theory is merely a recommendation about who to consult about GPT-4. When LLM Simulator Theory says that μ is a superposition of specific premises, each involving particular simulacra, that simply means that we should seek advice from the experts in the real-life objects that correspond to those simulacra if we wish to predict/explain/control the logits layer of the LLM. In short, the pragmatic approach asserts that the preferred decomposition is the one that enables us to tap into the knowledge and expertise of the best-suited humans for the job.
While I think that this particular claim is true, the pragmatist interpretation is not the correct interpretation of LLM Simulacra Theory. Instead, the bridging rules will appeal to an objective non-generic structure of the language model, in a manner analogous to physics.
10.D. Quantum mechanics
Fortunately for us, quantum mechanics has been wrestling with a very similar problem for about 50 years.
There are multiple ways to decompose a quantum state |Φ⟩ into a superposition ∑αi|ϕi⟩, and this presents the central conceptual/technical difficulty in linking classical mechanics to quantum mechanics. The problem has been especially pressing in the Everettian interpretation of quantum mechanics, which lacks measurements or wave-function collapse in its fundamental ontology. Rather than recapitulating the entire history of that debate, we can skip all the way to the present-day solutions and adapt them to LLM Simulatory Theory.
For the sake of brevity, I will refrain from discussing the solutions at length in this article. If you are swift, you might beat me to the punch and publish before I do.
Remark 11: Prompt engineering
11.A. Engineering (first draft):
To formalise prompt engineering in terms of Dynamic GPT, we must first formalise engineering in a general dynamic system, and then restrict the dynamic system to LLMs.
11.B. Prompt engineering (first draft):
Hence —
While this definition of "engineering" may miss some activities we wish to classify as engineering and include activities we don't wish to classify as engineering, it provides a useful starting point for formalized prompt engineering within LLM Dynamics.
Remark 12: Simulacra Ecology
Recall the program of semiotic physics — we will recover high-level LLM ontologies from the low-level LLM Dynamics, by copying the bridging principles connecting high-level physical ontologies to low-level physical dynamics.
This program will lead to a simulacra ecology.
Mechanistic interpretability will be valuable to simulacra ecology in the same way that QFT is valuable to conventional ecology. This is because simulacra ecology supervenes on the Static GPT in the same way that conventional ecology supervenes on the QFT. Nonetheless, there will be emergent laws and regularities at the higher-level ontology which can be studied with a moderate degree of autonomy.
Remark 13: Grand Unified Theory of LLMs
Up until now, we've considered μ as a mapping from page-states to page-states, induced by a general model μ:Tk→Δ(T). However, if GPT were induced by an arbitrary model μ, it's doubtful that it would "simulate" anything at all. In other words, LLM Simulator Theory is a non-generic higher-level ontology.
Therefore, to explain this non-generic structure, we must make additional assumptions about the language model μ:Tk→Δ(T). Fortunately, μ is not just any model — it's a model that results from training a transformer neural network with roughly 100 billion parameters using stochastic gradient descent (SGD) to minimize cross-entropy loss on a corpus of internet text.
A Grand Unified Theory of LLMs wouldn't just demonstrate that LLM Simulatory Theory happens to emerge from the LLM Dynamics of a particular trained model μ — it would also explain why SGD on transformer models encourages that emergent structure.
Remark 14: LLM-∞
14.A. Motivating LLM-∞
Presumably, the trained model exhibits the non-generic structure because, as we scale parameters and compute, the trained model asymptotically approaches an optimal model exhibiting the non-generic structure. By analysing the behaviour of this optimal model, we can achieve formal results.
I find it helpful to imagine μ∞ — the Solomonoff-optimal autoregressive language model with a context window k trained on a corpus C. Although this model is unrealistic (and uncomputable) it fulfils a role akin to the Solomonoff Prior in unsupervised learning or AIXI in reinforcement learning.
I will openly admit that when I predict/explain/control/interpret the behaviour of a large language model, I use LLM-∞ as a first-order approximation.
14.B. Formal definition of LLM-∞
Here is the formal definition of μ∞ —
14.C. Commentary
μ∞ has read the entirety of the internet and performed Solomonoff inference upon it, where Solomonoff inference is the maximally data-efficient architecture. The shuffling with ρ is to ensure that the model doesn't behave differently for the (N+1)-th datapoint, nor does the model learn spurious patterns about the order in which the internet corpus is provided.
Note that μ∞ is still limited by finite data N and finite context window k. You can think of μ∞ as the optimal architecture, and a 175B-parameter model trained with SGD performs well only in so far as it approximates μ∞.
Remark 15: GPT is a semiotic computer
15.A. GPT-4 is Commodore 64
GPT-2 has a context window of 2048 tokens. Because there are 50257 possible tokens, this means that GPT-2 is a 4 kB computer. (Not great, but enough to get you to the moon.)
GPT-3 has a context window of 4096, and GPT-3.5 has a context window of 8192, meaning that they are 8 kB and 16 kB computers respectively. Think of an Atari 800.
In contrast, GPT-4 boasts a massive context window of 32,768 tokens, giving it almost exactly as much memory capacity as a Commodore 64.
Now, you can't run GPT-4 on a Commodore 64. GPT-3 is parameterised by 175 billion half-precision floating points, so you'd need at least a 350 GB computer just to load up the weights of GPT-3, and you'd need a computer with much more memory to run GPT-4. But this is because the pesky laws of physics do not allow us to easily build a small circuit from transistors which converts the 64 kB–encoding of one page-state into the 64 kB–encoding of the second page-state.
15.B. GPT as von Neumann architecture
We can view GPT as a computer with Von Neumann architecture. The GPT transition function μ corresponds to the Central Processing Unit (CPU), and the context window corresponds to the memory.
In a classical computer, the memory stores both data and instructions, but there is no objective distinction between data and instructions. The distinction between "data" and "instructions" made by programmers is merely a conceptual distinction, rather than an objective mechanistic distinction. The only actual instructions are opcodes in assembly language (i.e. add, compare, copy, etc).
Similarly, proficient prompt engineers[4] make a distinction between "data" and "instructions" in the prompt, but this also is a conceptual distinction, rather than an objective mechanistic distinction.
15.C. Prompt engineering feels like early programming
If you're familiar with prompt engineering GPT-3/4, then you can attest that it feels like programming a Commodore 64. It feels like you're working with limited memory. When you construct the prompt, you must be careful about moving data from place-a to place-b only when data from place-b will no longer be used. The hallmark of a well-constructed prompt is its memory optimization, leaving no space for extraneous elements.
15.D. Prompts surjectively map to behaviour
Consider these questions —
The answer to each question depends on the memory tape.
And let's turn to these questions —
The answer to each question depends on the prompt.
Both Apple Mac and GPT-5 are universal computers, although GPT-5 is a bit of a weird one. They are general-purpose programmable computers, and therefore they will exhibit almost any behaviour if they are provided with the right prompt.
Just as we can't say anything about the behaviour of Apple Mac when acting on an arbitrary memory tape, we similarly can't say anything about the behaviour of GPT-5 when acting on an arbitrary prompt.
Remark 16: EigenFlux
16.A. High-level prompting language
The tale of programming languages in the 20th century looks something like this —
As memory hardware expanded, programmers could afford to permanently store a compiler in the computer's memory. Programmers could then write code in the high-level programming language, and the compiler (written in assembly language) would convert the high-level code into assembly code which acted on the data.
When writing high-level code, the programmer would only require access to the language's documentation, which describes the abstract behaviour of the particular primitives.
By abstract behaviour I mean —
Crucially, the programmer doesn't need access to the compiler written in assembly, and they don't need to understand it.
I conjecture that as context windows expand (or possibly disappear entirely), prompt engineers will recapitulate the history of programming languages on LLMs. With enough "memory" available in the context window, prompt engineers will be able to utilise high-level prompting languages.
16.B. Sketching the compiler
To illustrate the idea, I'll sketch a high-level prompting language called EigenFlux.
EigenFlux has a novel-length compiler-prompt that would look something like this —
Naturally, the actual compiler-prompt would abide by plot conventions and fictional tropes, rather than enumerating simulacra in an unnatural way.
Subsequently, a prompt engineer (who specializes in the EigenFlux high-level prompting language) would write prose that details the interaction of the various characters like Alice1, Alice2, Alice3, and so forth.
The prose crafted by the prompt engineer would constitute a program in EigenFlux. I call this segment of prose the compiled-prompt. Within the compiled-prompt, tokens would encode both the data and the instructions applied to the data.
The concatenation of the compiler-prompt and the compiled-prompt would be loaded into the context window, and Dynamic GPT would run until a predetermined stopping condition has been satisfied.
In essence, the prompt engineers would be writing "fan fiction" in the narrative universe of EigenFlux.
16.C. EigenFlux simulacra
In the high-level prompting language, the simulacra — Alice1, Alice2, Alice3, Alice4, etc — correspond to the primitive "instructions".
Recall that the Python primitive "sort" corresponds to a long segment of assembly code in the compiler. When the programmer calls "sort", then the assembly code is executed on the data.
Similarly, the EigenFlux primitive "Alice1" corresponds to a detailed characterisation in the compiler-prompt. When the prompt engineer mentions "Alice1", then the simulacra is elicited to interact with the data.
The Alice simulacra would be intentionally designed to be composable, allowing secondary prompt engineers to construct plots featuring these simulacra to perform general-purpose computation. We might have an Alice simulacrum corresponding to a ruthless bond trader, or a humble scientist, or a harmless assistant, or a Machiavellian strategist, or Paul Christiano, or an aligned AI. And so on.
16.D. Alignment relevance
We couldn't build Facebook in assembly language. Even if we tried, the code would yield nonsensical output. There is no chance the code would do what the user wanted, especially not on the first try.
And yet we have Facebook.
The trick is to climb the ladder of abstraction.
To build an AI system that is both sufficiently aligned and capable, we will likewise follow a step-by-step process.
If GPT-3 is Babbage's Engine, then I'm imagining Apple Mac. High-level prompt engineering is so essential that if it cannot be accomplished within the transformer paradigm, then we must redesign the deep learning architecture of the language model to enable high-level prompt engineering.
Remark 17: Simulacra realism
It remains an open technical question of whether μ does satisfy the requisite structural properties such that the bridge principles apply. But, so long as μ does satisfy those properties, then I propose that we are realist about the objects of the higher-level ontology.
Simulacra realism: the view that objects simulated on a dynamic large language model are real in the same sense that macroscopic physical objects are real.
The idea of simulacra realism may seem far-fetched at first, but it follows from "Dennet's Criterion", which David Wallace frequently uses to explain how a high-level ontology emerges from a low-level ontology.
I claim that if Dennet's Criterion justifies the realism about physical macro-objects, then it must also justify the realism about simulacra, so long as μ satisfies analogous structural properties.
The slogan is "simulacra : GPT :: objects : physics".
(Let me clarify that my support of simulacra realism doesn't stem from insufficiently reductionist intuitions. In fact, my intuitions are extremely physical reductionism. However, physical reductionism is something of a horseshoe. When you decompose macro-objects into increasingly low-level ontologies, you eventually bottom-out in an ontology that bears no resemblance to our common-sense picture of reality[3]. In order to reconstruct this common-sense picture, you must to adopt something like Dennet's Criterion or Wallace's Criterion ("a tiger is any pattern which behaves like a tiger"). Without such a criterion, we would be unable to admit the existence of fermions and bosons, let along chairs and people! Yet any criterion which admits the existence of fermions and bosons will also admit the existence of simulacra. In other words, my support for simulacra realism is not due to being more realist about simulacra, but rather being more anti-realist about physical macro-objects. In technical terminology, I deny strong emergence in both physical reality and LLMs, but admit weak emergence in both physical reality and LLMs.)
Remark 18: Meta-LLMology
18.A. Motivation meta-LLMology
LLMs are the most complicated entities that humanity has made, they are the compression of the sum total of all human history and knowledge, and they've existed for less than five years.
This raises hefty epistemological problems[5] —
My position is that because LLMs are a low-dimensional microcosm of reality entire, our epistemology of LLMs should be a microcosm of our epistemology of reality entire. One implication of this position is epistemic pluralism.
18.B. Epistemic pluralism
I propose epistemic pluralism in LLM-ology.
Now, this does not mean "anything goes" (epistemic anarchism), nor that we should have no standards whatsoever. Rather, it means that we have a set of distinct epistemic schemes that we use these schemes simulatenously to study LLMs.
By "epistemic scheme" I mean (roughly) an academic subfield. An epistemic scheme may have its own ontology, its own discoveries, its own concepts, its own own standards for accepting a law, its own methodology for generating laws, its own experts, its own institutions, its own objectives, etc. Sometimes you'll have different epistemic scheme which specialise in different entities in the world. But often you'll have multiple epistemic schemes which all discuss the same entity. So microeconomics is a scheme — as is topology, game theory, developemental psychology, molecular biology, cognitive linguistics, comparative anthropology, biochemistry, computational neuroscience, ecology, cybersecurity, ancient history, evolutionary biology, network theory, quantum information theory, cybernetics, etc.
This still leaves an open question — which schemes we should include in this set?
18.C. Trust whomever you already trust
Here is my tentative answer:
This is because LLMs are a low-dimensional microcosm of reality as a whole. If a particular epistemic scheme (e.g. ecology) has proved valuable for understanding some aspect of reality (e.g. tigers), then we should prima facie expect that the scheme will be valueable for understanding the aspect of the LLM which corresponds to that aspect of reality (e.g. simulacra tigers).
However, I want to make two disclaimers —
I originally called this "GPT Dynamics" rather than "LLM Dynamics". However, I think the AI Alignment community should stop using "GPT" as a metanym for LLMs (large language models), to avoid promoting OpenAI relative to Anthropic and Conjecture.
mushroom spores + lava lamps + James Joyce + toothpaste + bicycle bells + origami + nebulae + MIRI + lemurs + bonsai trees + snowflakes + accordions + solar flares + titanium + dreamcatchers + North Dakota + pierogis + sand dunes + avocado toast + cactai+ spaghetti westerns + yurts + neutrinos + lemon zest + crop circles + Paul Christiano + gelato + calligraphy + lichen + hula hoops + fractals + umbrellas + chameleons + sombreros + Hertford College, Oxford + marionettes + jackfruit + ice sculptures + jazz + crepuscular rays + velvet + hieroglyphs + kaleidoscopes + tarantulas + narwhals + pheromones + laughter + pumice + me + this article + champagne + bioluminescence + tempests + ziggurats + pantomimes + marzipan + daffodils + GPT-4 + tesseracts + glockenspiels + chiaroscuro + sonnets + honeycomb + aurora borealis + trilobites + sundials + lenticular clouds + gondolas + macarons + ...
AdS/CFT
Janus and I
I think Conjecture has an offical epistemolgist on their team!