I seem to be in the process of figuring out what I’ve learned about Large Language Models in the process of playing around with ChatGPT since December of last year. I’ve already written three posts during this phase, which I’ll call my Entanglement phase, since this re-thinking started with the idea that entanglement is the appropriate way to think about word meaning in LLMs. This post has three sections.
The first section is stuff from Linguistics 101 about form and meaning in language. The second argues that LLMs are an elaborate structure of relational meaning between words and higher order structures. The third is about the distinction between sentences and higher-level structures and the significance that has for learning. I conjecture that there will come point during training when the engine learns to make that distinction consistently and that that point will lead to a phase change – grokking? – in its behavior.
Language: Form and Meaning
Let us start with basics: Linguists talk of form and meaning; Saussure talked of signifier and signified. That is to say, words consist of a form, or signifier, a physical signal such as a sound or a visual image, which is linked to or associated with a meaning, or signified, which is not so readily characterized and, in any event, is to be distinguished from the referent or interpretant (to use Pierce’s term). Whatever meaning is, it is something that exists in the minds/brains of speakers and only there.
Large Language Models are constructed over collections of linguistic forms or signifiers. When humans read texts generated by LLMs, we supply those strings of forms with meanings. Does the LLM itself contain meanings? That’s a tricky question.
On one sort of account, favored by at least some linguistics and others, no, they do not contain meanings. On a different sort of account, yes, they do. For the LLM is a sophisticated and complicated structure based on co-occurrence statistics of word forms. This is sometimes referred to in the literature as inferential meaning, as opposed to referential meaning. I prefer the term relational meaning, and see it in contrast to both adhesion and intention.
While I do not believe that relational meaning is fully equivalent to meaning, as the term is ordinarily used (and in academic discourse as well), I don’t wish to discuss that matter here. See my blog post, The issue of meaning in large language models (LLMs), for a discussion of these terms. In this post I’m concerned with what LLMs can accomplish through relational meaning alone.
Relational meaning in an LLM [+ recursion]
The primary vehicle for relational meaning is a word embedding vector associated with each word. It is my understanding that in the case of the GPT-3 series, including ChatGPT, that vector has roughly 12 thousand terms. So, the word embedding vector locates each word in a 12K dimensional space that characterizes relationships among words.
Words are not, however, represented in LLMs as alpha-numeric ASCII strings. Rather, they are tokenized. In GPT byte pair encoding (BPE) is used. The details are irrelevant for my present purposes. For my purposes what’s important is that the BPE tokens function as a mediator between alphanumeric strings at input and output and the meaning-bearing vectors.
While one might be tempted to think of the relationship between token and associated vector as being like the form/meaning or signifier/signified relationship, that is not the case. We can think of word forms or signifiers as forming an index over the space of meanings/signifiers – see the discussion on indexing in the paper David Hays and I published in 1988, Principles and Structure of Natural Intelligence. The tokens do not index anything. Their sole function is to mediate between alphanumeric strings and meaning vectors. From this it follows that an LLM is a structure of pure relational meaning.
Think about that for a minute. It’s a structure of relationships between tokens, no more, no less. Those relationships ‘encode’ not only word meanings, but meanings of higher order structures as well, sentences and even whole texts.
This implies, in turn, that, whatever differences there are between human memory and language (as realized in the brain) and that of LLMs, there is a fundamental architectural difference. LLMs are single-stream processors while the human system is a double-stream processor. The world of signifieds is a single stream unto itself; call it the primary stream. The addition of signifiers adds a secondary stream that can act on the world of signifiers and manipulate it – see e.g. Vygotsky’s account of a language acquisition. Note, however, that as signifiers themselves can be objects of perception and conceptualization, the primary stream can perceive and conceptualize the secondary stream, Jakobson’s metalingual function. Thus recursion is explicitly introduced into the system.
How is this structure of relationships created? [grokking]
We’re told that it’s created by having the engine predict the next token in a text. The parameter weights of the model are then adjusted depending on whether or not the prediction was correct, requiring one kind of adjustment, or not, a different kind of adjustment. This continues for work after word through thousands and millions of texts.
The predictions are based on the state of the model at the time the prediction is made. But it is takes into account the embedding vector for the word that is the “jumping off point” for the prediction. Once a prediction has been made, its success appraised, and the model adjusted, the next word in the input string becomes the jumping off point for a prediction. And so on.
In this way a fabric of relationships is woven among words and strings. Next-word-prediction is a device for weaving this fabric.
Now, I have read that, though I cannot offer a citation at the moment, language syntax tends to be constructed in the first few layers of deep neural nets. As there is a major difference between syntactic structure and discourse structure, it makes sense that syntactic structure should be realized in a specific part of the model.
Transitions within a sentence are tightly constrained by the topic and syntax of the sentence. Transitions between one sentence and the next, however, are considerably looser. There are no syntactic constraints at all. The constraints are entirely semantic and thematic. Just how tight those constraints are depends on the structure of the document (the story paper discusses this a bit). This is something I discussed in ChatGPT tells stories, and a note about reverse engineering: A Working Paper, pp. 3 ff.
What I’m wondering is if there is a certain point during the training process that the model realizes there is a distinction between transitions from one word to the next within a sentence and transitions from the end of a sentence to the beginning of the next sentence. I would think that realizing that would increase the accuracy of the engine’s predictions. If the engine doesn’t recognize that distinction and take it into account in making predictions its predictions within sentences will be needlessly scattershot, leading to a high error rate. And perhaps its predictions between sentences will be too constrained, forcing it to ‘waste’ unnecessary predictions will exploring the upcoming semantic space.
Would consistently recognizing the distinction between these two kinds of predictions lead to such dramatically improved performance that we can talk of a phase shift? Would that be the kind of phase shift referred to as grokking in the interpretability literature (Nanda, Chan, et al. 2023)? That kind of behavior has been observed in a recent study:
Angelica Chen, Ravid Schwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, and Naomi Saphra, Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs, arXiv:2309.07311v1 [cs.CL] 13 Sept 2023.
Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. In this paper, we present a case study of syntax acquisition in masked language models (MLMs). Our findings demonstrate how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in training when models abruptly acquire SAS and find that this window is concurrent with a steep drop in loss. Moreover, SAS precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by introducing a regularizer to manipulate SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits and capabilities during training, and that briefly suppressing SAS can improve model quality. These findings reveal a real-world example of the relationship between disadvantageous simplicity bias and interpretable breakthrough training dynamics.
Cross posted from New Savanna.
I seem to be in the process of figuring out what I’ve learned about Large Language Models in the process of playing around with ChatGPT since December of last year. I’ve already written three posts during this phase, which I’ll call my Entanglement phase, since this re-thinking started with the idea that entanglement is the appropriate way to think about word meaning in LLMs. This post has three sections.
The first section is stuff from Linguistics 101 about form and meaning in language. The second argues that LLMs are an elaborate structure of relational meaning between words and higher order structures. The third is about the distinction between sentences and higher-level structures and the significance that has for learning. I conjecture that there will come point during training when the engine learns to make that distinction consistently and that that point will lead to a phase change – grokking? – in its behavior.
Language: Form and Meaning
Let us start with basics: Linguists talk of form and meaning; Saussure talked of signifier and signified. That is to say, words consist of a form, or signifier, a physical signal such as a sound or a visual image, which is linked to or associated with a meaning, or signified, which is not so readily characterized and, in any event, is to be distinguished from the referent or interpretant (to use Pierce’s term). Whatever meaning is, it is something that exists in the minds/brains of speakers and only there.
Large Language Models are constructed over collections of linguistic forms or signifiers. When humans read texts generated by LLMs, we supply those strings of forms with meanings. Does the LLM itself contain meanings? That’s a tricky question.
On one sort of account, favored by at least some linguistics and others, no, they do not contain meanings. On a different sort of account, yes, they do. For the LLM is a sophisticated and complicated structure based on co-occurrence statistics of word forms. This is sometimes referred to in the literature as inferential meaning, as opposed to referential meaning. I prefer the term relational meaning, and see it in contrast to both adhesion and intention.
While I do not believe that relational meaning is fully equivalent to meaning, as the term is ordinarily used (and in academic discourse as well), I don’t wish to discuss that matter here. See my blog post, The issue of meaning in large language models (LLMs), for a discussion of these terms. In this post I’m concerned with what LLMs can accomplish through relational meaning alone.
Relational meaning in an LLM [+ recursion]
The primary vehicle for relational meaning is a word embedding vector associated with each word. It is my understanding that in the case of the GPT-3 series, including ChatGPT, that vector has roughly 12 thousand terms. So, the word embedding vector locates each word in a 12K dimensional space that characterizes relationships among words.
Words are not, however, represented in LLMs as alpha-numeric ASCII strings. Rather, they are tokenized. In GPT byte pair encoding (BPE) is used. The details are irrelevant for my present purposes. For my purposes what’s important is that the BPE tokens function as a mediator between alphanumeric strings at input and output and the meaning-bearing vectors.
While one might be tempted to think of the relationship between token and associated vector as being like the form/meaning or signifier/signified relationship, that is not the case. We can think of word forms or signifiers as forming an index over the space of meanings/signifiers – see the discussion on indexing in the paper David Hays and I published in 1988, Principles and Structure of Natural Intelligence. The tokens do not index anything. Their sole function is to mediate between alphanumeric strings and meaning vectors. From this it follows that an LLM is a structure of pure relational meaning.
Think about that for a minute. It’s a structure of relationships between tokens, no more, no less. Those relationships ‘encode’ not only word meanings, but meanings of higher order structures as well, sentences and even whole texts.
This implies, in turn, that, whatever differences there are between human memory and language (as realized in the brain) and that of LLMs, there is a fundamental architectural difference. LLMs are single-stream processors while the human system is a double-stream processor. The world of signifieds is a single stream unto itself; call it the primary stream. The addition of signifiers adds a secondary stream that can act on the world of signifiers and manipulate it – see e.g. Vygotsky’s account of a language acquisition. Note, however, that as signifiers themselves can be objects of perception and conceptualization, the primary stream can perceive and conceptualize the secondary stream, Jakobson’s metalingual function. Thus recursion is explicitly introduced into the system.
How is this structure of relationships created? [grokking]
We’re told that it’s created by having the engine predict the next token in a text. The parameter weights of the model are then adjusted depending on whether or not the prediction was correct, requiring one kind of adjustment, or not, a different kind of adjustment. This continues for work after word through thousands and millions of texts.
The predictions are based on the state of the model at the time the prediction is made. But it is takes into account the embedding vector for the word that is the “jumping off point” for the prediction. Once a prediction has been made, its success appraised, and the model adjusted, the next word in the input string becomes the jumping off point for a prediction. And so on.
In this way a fabric of relationships is woven among words and strings. Next-word-prediction is a device for weaving this fabric.
Now, I have read that, though I cannot offer a citation at the moment, language syntax tends to be constructed in the first few layers of deep neural nets. As there is a major difference between syntactic structure and discourse structure, it makes sense that syntactic structure should be realized in a specific part of the model.
Transitions within a sentence are tightly constrained by the topic and syntax of the sentence. Transitions between one sentence and the next, however, are considerably looser. There are no syntactic constraints at all. The constraints are entirely semantic and thematic. Just how tight those constraints are depends on the structure of the document (the story paper discusses this a bit). This is something I discussed in ChatGPT tells stories, and a note about reverse engineering: A Working Paper, pp. 3 ff.
What I’m wondering is if there is a certain point during the training process that the model realizes there is a distinction between transitions from one word to the next within a sentence and transitions from the end of a sentence to the beginning of the next sentence. I would think that realizing that would increase the accuracy of the engine’s predictions. If the engine doesn’t recognize that distinction and take it into account in making predictions its predictions within sentences will be needlessly scattershot, leading to a high error rate. And perhaps its predictions between sentences will be too constrained, forcing it to ‘waste’ unnecessary predictions will exploring the upcoming semantic space.
Would consistently recognizing the distinction between these two kinds of predictions lead to such dramatically improved performance that we can talk of a phase shift? Would that be the kind of phase shift referred to as grokking in the interpretability literature (Nanda, Chan, et al. 2023)? That kind of behavior has been observed in a recent study:
* * * * *
More later.