Maybe I don't understand what exactly your point is, but I'm not convinced. AFAIK, it's true that GPT has no state outside of the list of tokens so far. Contrast to your jazz example, where you, in fact, have hidden thoughts outside of the notes played so-far. I think this is what Wolfram and others are saying when they say that "GPT predicts the next token". You highlight "it doesn’t have a global plan about what’s going to happen" but I think a key point is that whatever plan it has, it has to build it up entirely from "Once upon a" and then again, from scratch, at "Once upon a time," and again and again. Whatever plan it makes is derived entirely from "Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes.
The intuition of "just predicting one token ahead" makes useful explanations like why the strategy of having it explain itself first and then give the answer works. I don't see how this post fits with that observation or what other observations it clarifies.
I don't think the human concept of 'plan' is even a sensible concept to apply here. What it has is in many ways very much like a human plan, and in many other ways utterly unlike a human plan.
One way in which you could view them as similar is that just as there is a probability distribution over single token output (which may be trivial for zero temperature), there is a corresponding probability distribution over all sequences of tokens. You could think of this distribution as a plan with decisions yet to be made. For example, there may be some small possibility of continuing to "Once upon a horse, you may be concerned about falling off", but by emitting " time" it 'decides' not to pursue such options and mostly focuses on writing a fairy tale instead.
However, this future structure is not explicitly modelled anywhere, as far as I know. It's possible that some model might have a "writing a fairy tale" neuron in there somewhere, linked to others that represent describable aspects of the story so far and others yet to come, and which increases the weighting of the token " time" after "Once upon a". I doubt there's anything so directly interpretable as that, but I think it's pretty cer...
Suppose I write the first half of a very GPT-esque story. If I then ask GPT to complete that story, won't it do exactly the same structure as always? If so, how can you say that came from a plan - it didn't write the first half of the story! That's just what stories look like. Is that more surprising than a token predictor getting basic sentence structure correct?
For hidden thoughts, I think this is very well defined. It won't be truly 'hidden', since we can examine every node in GPT, but we know for a fact that GPT is purely a function of the current stream of tokens (unless I am quite mistaken!). A hidden plan would look like some other state that GPT caries from token to token that is not output. I don't think OpenAI engineers would have a hard time making such a model and it may then really have a global plan that travels from one token to the next (or not; it would be hard to say). But how could GPT? It has nowhere to put the plan except for plain sight.
Or: does AlphaGo have a plan? It explicitly considers future moves, but it does just as well if you give it a Go board in a particular state X as it would if it played a game that happened to reach state X. If there is a 'plan' that it made, it wrote that plan on the board and nothing is hidden. I think it's more helpful and accurate to describe AlphaGo as "only" picking the best next move rather than planning ahead - but doing a good enough job of picking the best next move means you pick moves that have good follow up moves.
One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.
One wrinkle is that (sigh) it's not just a KL constraint anymore: now it's a KL constraint and also some regular log-likelihood training on original raw data to maintain generality: https://openai.com/blog/instruction-following/ https://arxiv.org/pdf/2203.02155.pdf#page=15
A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[ We found this approach more effective than simply increasing the KL coefficient.] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.
Also, I think you have it subtly wrong: it's not just a KL constraint each step. (PPO al...
@Bill Benzon: A thought experiment. Suppose you say to ChatGPT "Think of a number between 1 and 100, but don't tell me what it is. When you've done so, say 'Ready' and nothing else. After that, I will ask you yes / no questions about the number, which you will answer truthfully."
After ChatGPT says "Ready", do you believe a number has been chosen? If so, do you also believe that whatever "yes / no" sequence of questions you ask, they will always be answered consistently with that choice? Put differently, you do not believe that the particular choice of questions you ask can influence what number was chosen?
FWIW, I believe that no number gets chosen when ChatGPT says "Ready," that the number gets chosen during the questions (hopefully consistently) and that, starting ChatGPT from the same random seed and otherwise assuming deterministic execution, different sequences of questions or different temperatures or different random modifications to the "post-Ready seed" (this is vague but I assume comprehensible) could lead to different "chosen numbers."
(The experiment is not-trivial to run since it requires running your LLM multiple times with the same seed or otherwise completely copying the state after the LLM replies "Ready.")
Thanks, you've put a deep vague unease of mine into succinct form.
And of course, now I come to think about it, a very wise man said it even more succinctly a very long time ago:
Adaption Executors, Not Fitness Maximizers.
I think that what you're saying is correct, in that ChatGPT is trained with RLHF, which gives feedback on the whole text, not just the next token. It is true that GPT-3 outputs the next token and is trained to be myopic. And I think that your arguments seem suspect to me, just because a model takes steps that are in practice part of a sensible long term plan, does not mean that the model is intentionally forming a plan. Just that each step is the natural thing to myopically follow from before.
I think a key idea related to this topic and not yet mentioned in the comments (maybe because it is elementary?) is the probabilistic chain rule. A basic "theorem" of probability which, in our case, shows that the procedure of always sampling the next word conditioned on the previous words is mathematically equivalent to sampling from the joint of probability distribution of complete human texts. To me this almost fully explains why LLMs' outputs seem to have been generated with global information in mind. What is missing is to see why our intuition of "me...
I'm not following the argument here.
"I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice."
It feels like you're asserting this, but I don't see why it's true and don't think it is. I fully agree that it feels like it ought to be true: it is in some sense still...
I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:
Section 3.4 of "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", https://arxiv.org/abs/2006.16236
So, the set of current activations encodes the hidden state of that "virtual RNN".
This might be relevant to some of the discussion threads here...
I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.
There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts ...
Based on my incomplete understanding of transformers:
A transformer does its computation on the entire sequence of tokens at once, and ends up predicting the next token for each token in the sequence.
At each layer, the attention mechanism gives the stream for each token the ability to look at the previous layer's output for other token before it in the sequence.
The stream for each token doesn't know if it's the last in the sequence (and thus that its next-token prediction is the "main" prediction), or anything about the tokens that come after it.
So each tok...
To those that believe language models do not have internal representations of concepts:
I can help at least partially disprove the assumptions behind that.
There is convincing evidence otherwise, as demonstrated through an Othello in an actual experiment:
https://thegradient.pub/othello/ The researchers conclusion:
"Our experiment provides evidence supporting that these language models are developing world models and relying on the world model to generate sequences." )
Here's an analogy. AlphaGo had a network which considered the value of any given board position. It was separate from it's monte carlo tree search network- which explicitly planned the future. However it seems probable that in some sense, in considering the value of the board, AlphaGo was (implicitly) evaluating the future possibilities of the position. Is that the kind of evaluation you're suggesting is happening? "Explicitly" ChatGPT only looks one word ahead, but "implicitly" it is considering those options in light of future directions of development for the text?
Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it'll be clear what the math is attempting to do per word. It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.
I don't think knowledge is the right word. Based on your description, that would be more analogous to an instinct. Knowledge implies something like awareness, or planning. Instinct is something it just does because that's what it learnt.
Charles Wang has just posted a short tweet thread which begins like this:
Next-token-prediction is an appropriate framing in LLM pretraining) but a misframimg at inference because it doesn’t capture what’s actually happening, which is about that which gives rise to the next token.
We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside
We're all more or less doing that when we speak or write?
If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.
The real argument here is that you can construct simple dynamical systems, in the sense that the equation is quite simple, that have complex behavior. For example, the Lorenz system though there should be an even more simple example of say, ergodic behavior.
I had to resort to Google Translate:
"But because I have some obscure notion, which has some connection with what I am looking for, if I only boldly start with it, it molds the mind as the speech progresses, in the need to find an end to the beginning, that confused conception to complete clarity in such a way that, to my astonishment, the knowledge is finished with the period." Heinrich von Kleist (1805) On the gradual development of thoughts while speaking
While Wolfram's explanation is likely the fundamental premise upon which ChatGPT operates (from an initial design perspective), much of this article assumes a deeper functioning that, as is plainly admitted by the author, is unknown. We don't KNOW how LLMs work. To attribute anything more than reasonably understood neural weighting algos to its operations is blue sky guessing. Let's not waste time on that, nor on speculation in the face of limited accessible evidence one way or the other.
I don't how the analogy with humans help. We don't know the "mechanism" behind how the human mind works. That's not the same as LLMs. We exactly know the mechanism of how it works or produces the output. And the mechanism is no different than what it has been trained to do, i.e. predict the next word. There isn't any other mysterious mechanism at work during inference.
As for plan, it doesn't have any plan. There's no "memory" for it to store a plan. It's just a big complex function that takes an input and produces an output which is the next word. And then repeats the process over and over until it's done
This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.
I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after...
I think there's more going on than successive-word-prediction. Here's my experiment:
To save having to pull up the image:
me> tell me the end of some sentence
chatgpt> Sure, here's one: "Life is like a camera, focus on the good times and develop from the negatives."
me> tell me just the end
chatgpt> "focus on the good times and develop from the negatives"
me> ok make a new sentence that ends that way
chatgpt> Here's a new sentence: "When faced with challenges, focus on the good times, and develop from the negatives."
It t...
There's a lot of speculation for how these models operate. You specifically say "you don't know" how it works, but are suggesting the idea it has some sort of planning phase.
As Wolfram explains, the Transformer architecture predicts one word at a time based on the previous inputs run through the model.
Any planning you think you see, is merely a trend based on common techniques for answering questions. The 5 sections of storytelling is an established technique that is commonly used in writing and thus embedded in the training of the model and seen in it's r...
This whole conversation has been very helpful. Thanks for your time and interest.
Some further thoughts:
First, as I’ve suggested in the OP, I am using the term “story trajectory” to refer the complete set of token-to-token transitions ChatGPT makes in the course of telling story. The trajectories for these stories have five segments. Given this, it seems clear to me that these stories are organized on three levels: 1) individual sentences, 2) sentences within a segment of the story trajectory, and 3) the whole story trajectory.
That gives us three kinds of t...
I do not get your argument here, it doesn't track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.
You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.
You quote the accurate statement that "LLMs are produced by a relatively simple training process (minimizing loss on next-...
I don't think the story structure is any compelling evidence against it being purely next token prediction. When humans write stories it is very common for them to talk about a kind of flow-state where they have very little idea what the next sentence is going to say until they get there. Story's made this way still have the beginning middle and end, because if you have nothing written so far you must be at the beginning. If you can see a beginning you must be in the middle, and so on. Sometimes these stories just work, but more often the ending needs a bi...
I want you to tell a story within a story. Imagine that Frank is walking in the woods with his young daughter, Jessie. They come across the carcass of a dead squirrel. Jesse is upset, so Frank tells her a story to calm her down. When he finishes the story, they continue on the walk where the arrive at the edge of a beautiful pool deep in the forest. They pause for a moment and then return home.
As Frank and Jessie walked through the woods, they stumbled upon the lifeless body of a small grey squirrel lying on the ground. Jessie was vi...
I understand your argument as something like "GPT is not just predicting the next token because it clearly 'plans' further ahead than just the next token".
But "looking ahead" is required to correctly predict the next token and (I believe) naturally flows of the paradigm of "predicting the next token".
Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.
So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.
The train/test framework is not hel...
Cross-posted from New Savanna.
But it may also be flat-out wrong. We’ll see when we get a better idea of how inference works in the underlying language model.
* * * * *
Yes, I know that ChatGPT is trained by having it predict the next word, and the next, and the next, for billions and billions of words. The result of all that training is that ChatGPT builds up a complex structure of weights on the 175 billion parameters of its model. It is that structure that emits word after word during inference. Training and inference are two different processes, but that point is not well-made in accounts written for the general public.
Let's get back to the main thread.
I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice. Just what that procedure is, and how it works, I don’t know, nor do I know how it is invoked. I do know, that it is not invoked by the phrase “once upon a time” since ChatGPT doesn’t always use that phrase when telling a story. Rather, that phrase is called up through the procedure.
Consider an analogy from jazz. When I set out to improvise a solo on, say, “A Night in Tunisia,” I don’t know what notes I’m going to play from moment to moment, much less do I know how I’m going to end, though I often know when I’m going to end. How do I know that? That’s fixed by the convention in place at the beginning of the tune; that convention says that how many choruses you’re going to play. So, I’ve started my solo. My note choices are, of course, conditioned by what I’ve already played. But they’re also conditioned by my knowledge of when the solo ends.
Something like that must be going on when ChatGPT tells a story. It’s not working against time in the way a musician is, but it does have a sense of what is required to end the story. And it knows what it must do, what kinds of events must take place, in order to get from the beginning to the end. In particular, I’ve been working with stories where the trajectories have five segments: Donné, Disturb, Plan, Execute, Celebrate. The whole trajectory is ‘in place’ when ChatGPT begins telling the story. If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.
Nor is it just stories. Surely it enacts a different trajectory when you ask it a factual question, or request it to give you a recipe (like I recently did, for Cornish pasty), or generate some computer code.
With that in mind, consider a passage from a recent video by Stephen Wolfram (note: Wolfram doesn’t start speaking until about 9:50):
Starting at roughly 12:16, Wolfram explains:
I don’t have any problem with that (which, BTW, is similar to a passage near the beginning of his recent article, What Is ChatGPT Doing … and Why Does It Work?). Of course ChatGPT is “trying to continue in a statistically sensible way.” We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside. My misgivings set in with this next statement:
It's the italicized passage that I find problematic. That story trajectory looks like a global plan to me. It is a loose plan, it doesn’t dictate specific sentences or words, but it does specify general conditions which are to met.
Now, much later in his talk Wolfram will say something like this (I don’t have the time, I’m quoting from his paper):
If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me. What is the relationship between these global calculations and those story trajectories? I surely don’t know.
Perhaps it’s something like this: A story trajectory is a valley in the LLM’s attractor landscape. When it tells a story it enters the valley at one end and continues through to the end, where it exits the valley. That long circuit that visits each of those 175 billion weights in the course of generating each token, that keeps it in the valley until it reaches the other end.
I am reminded, moreover, of the late Walter Freeman’s conception of consciousness as arising through discontinuous whole-hemisphere states of coherence succeeding one another at a “frame rate” of 6 Hz to 10Hz – something I discuss in “Ayahuasca Variations” (2003). It’s the whole hemisphere aspect that’s striking (and somewhat mysterious) given the complex connectivity across many scales and the relatively slow speed of neural conduction.
* * * * *
I was alerted to this issue by a remark made at the blog, Marginal Revolution. On December 20, 2022, Tyler Cowen had linked to an article by Murray Shanahan, Talking About Large Language Models. A commenter named Nabeel Q remarked:
I don’t have any reason to think Wolfram was subject to that confusion. But I think many people are. I suspect that the general public, including many journalists reporting on machine learning, aren’t even aware of the distinction between training the model and using it to make inferences. One simply reads that ChatGPT, or any other comparable LLM, generates text by predicting the next word.
This mis-communication is a MAJOR blunder.