Cross-posting this from my blog, since it seems relevant.
The case for GPT understanding language, by way of understanding the world
There's a debate going on about whether or not language models similar to ChatGPT have the potential to be scaled up to something truly transformative. There's a group of mostly cognitive linguists (e.g. Gary Marcus ) that hold that ChatGPT does not understand language, it merely models what word is likely to follow the preceding- and this is importantly different from true language understanding. They see this as an "original sin" of language models which means there are limits to how good language models can get.
Freddie de Boer says much the same thing:
You could say that ChatGPT has passed Winograd’s test with flying colors. And for many practical purposes you can leave it there. But it’s really important that we all understand that ChatGPT is not basing its coindexing on a theory of the world, on a set of understandings about the understandings and the ability to reason from those principles to a given conclusion. There is no place where a theory of the world “resides” for ChatGPT, the way our brains contain theories of the world. ChatGPT’s output is fundamentally a matter of association - an impossibly complicated matrix of associations, true, but more like Google Translate than like a language-using human. If you don’t trust me on this topic (and why would you), you can hear more about this from an expert on this recent podcast with Ezra Klein.
It is true that GPT works by predicting the next token of language. It is, in some sense, as Gary Marcus put it “a glorified spreadsheet” built for this purpose. However, I do not think that this is contradictory to the notion that it understands, even if imperfectly, both language and world. My view is that in a very large language corpus, there are patterns that correspond to the way things relate to each other in the world. As models of language become more sophisticated and predictively powerful, they necessarily become models not just of language, but of the data-generating process for a corpus of that language. That data-generating process for language includes a theory of the world. Given enough data (billions or even trillions of tokens) these models can be very accurate.
We might say:
If X models Y, and Y models Z,
and if each model is sufficiently faithful,
Then X is a model of Z.
Since GPT is a superb model of language, and language corpora are such a superb model of the world, it seems to me that GPT is probably a model of the world. Moreover, ChatGPT in particular is a model very capable of responding to questions about the reality it models. I think that it’s usually true that if I contain a detailed model of something, and I can use that model to solve difficult problems about the thing it models, then I understand that thing to at least some degree. I want to say then that GPT understands the world- or that if it doesn’t understand the world, the problem isn’t that its understanding of language is ungrounded in extralinguistic factors.
But it seems there's a sort of prejudice against learning by direct exposure to language, as opposed to other kinds of sensory stimulus. Learning language, without being exposed to other sensory modalities of the world, is seen as disconnected from reality, but that seems rather unfair- why think a string of words alone is insufficient to infer an underlying reality, but a field of color patches alone might be sufficient?
Consider a kind of naive empiricist view of learning, in which one starts with patches of color in a field (vision), and slowly infers an underlying universe of objects through their patterns of relations and co-occurrence. Why is this necessarily any different or more grounded than learning by exposure to a vast language corpus, wherein one also learns through gaining insight into the relations of words and their co-occurences?
Suppose that we trained a computer to get very good at predicting what visual experiences will follow after previous visual experiences. Imagine a different Freddie de Boer wrote this:
You could say that VisionPredictor has passed the test of predicting future patches of colour with flying colours. And for many practical purposes you can leave it there. But it’s really important that we all understand that VisionPredictor is not basing its predictions on a theory of the world, on a set of understandings about the understandings and the ability to reason from those principles to a given conclusion. There is no place where a theory of the world “resides” for VisionPredictor, the way our brains contain theories of the world. VisionPredictors output is fundamentally a matter of association - an impossibly complicated matrix of associations between patches of colour.
No one would buy this- they’d come to the more reasonable conclusion that VisionPredictor had constructed implicit theories about ideas like “chairs” and “trees” which are used to predict what the next frame of what it sees will look like. If something that looks like a chair is in one frame, they’ll probably be one in the next frame, looking the appropriate way given the apparent motion of the camera. Certainly, if VisionPredictor is vast- with billions of parameters- and complex in its structure- with dozens of successive layers, it will have enough room to store and implement such a theory. Moreover, a process of training via backpropagation and gradient descent will lead it- through a quasi-evolutionary process- to such a theory, as the only efficient way to predict the next visual frame.
So why do people have more trouble thinking that people could understand the world through pure vision than pure text? I think people's different treatment of these cases- vision and language- may be caused by a poverty of stimulus- overgeneralizing from cases in which we have only a small amount of text. It's true that if I just tell you that all qubos are shrimbos, and all shrimbos are tubis, you'll be left in the dark about all of these terms, but that intuition doesn't necessarily scale up into a situation in which you are learning across billions of instances of words and come to understand their vastly complex patterns of co-occurrence with such precision that you can predict the next word with great accuracy. Based on that vast matrix of words- words from copra being used to describe the world- you’d have enough data to construct a theory of what the underlying reality that generated those corpora looked like.
The fundamental pattern in language learning seems to me to be the same in the naïve empiricist story- there are bits [patches of color or words] that are inherently meaningless, but co-occur in certain ways, gradually from conjunctions we build up a theory of an underlying world behind the [patches of color or words]. Understanding built on a vast corpus of text is no different to understanding built on thousands of hours of visual experience.
True, vision feels like a more immediate connection with reality. But remember that this immediacy- this sense of directly seeing “tables” and “chairs” is mediated through numerous mechanisms- some evolved, some learned and others in between- constructing a theory of the arrangement of things from patches of color, using clues like movement, change, the difference in what the two eyes see, etc.
So there’s little reason to think a theory of the world couldn’t be learned from text in principle. Could GPT-3 in particular learn a model from the world? It seems to me there is no reason to think that GPT-3 couldn't contain, in its billions of parameters and scores of layers, a model of the world. You can model just about anything with neural nets, and certainly, the training data seems extensive enough. Moreover, without going into the technical details, my understanding is that if creating a model of the world to predict the next world is the only viable way of predicting the next word with sufficient accuracy, the process of backpropagation and gradient descent will drive GPT towards it.
So we have three reasons to think ChatGPT understands ‘the world behind the language’:
1. It's superb at behaving as if does.
2. In principle, it has enough free variables to represent a pretty good model of the world.
3. There's no reason a model of the world couldn't be learned from a very large corpus of text that represents the world, and the interrelations between words in that corpus, that correspond to patterns in the world. It’s possible in the same way that viewing nothing but a sequence of films (a sequence of arranged pixels) could be used to construct a theory of the world.
I ask ChatGPT
Me: Okay, I'm going to lay my cards on the table here. I'm a philosopher with an interest in language and AI. A common idea about language is that in order to understand language, an agent must be "grounded" in a certain sense- it must correlate words with experiences. Your existence- ChatGPT- has threatened many people's conceptions- because your training data is exclusively textual, and yet you seem to have a comprehension of the world in the operational sense that you can answer questions about it. There's a huge debate going on as to whether you can understand language as such. I'd like to ask if there was anything you wanted to add to that debate?
ChatGPT: As a language model, my understanding of language is based on the patterns and relationships that I have learned from the large amount of text data that I have been trained on. This text data includes a wide variety of information about the world, including facts, ideas, and experiences. While I do not have my own experiences or consciousness, I can use the patterns and relationships that I have learned to generate responses to questions and prompts that are coherent and informative.
It is true that my understanding of language is different from that of humans, and some people may question whether I can truly understand language in the same way that humans do. However, it is also important to recognize that my ability to generate responses to questions and prompts can be seen as a form of understanding in its own right. Additionally, my ability to generate responses to questions and prompts can be tested, and my performance can be evaluated and compared with that of humans.
It is also worth noting that the debate about AI and language understanding is ongoing and complex, and there are many different perspectives and opinions on this topic. My own capabilities are still limited and there are a lot of open question about the nature of intelligence and understanding, and it is an active area of research.
There appears to be an approximate world model inside GPT-3. It understands relative sizes, occlusions, that objects can't occupy the same space but can be adjacent, and the directionality and usability of objects. When I asked further questions about relating these objects to physical interactions (modeling the cube thrown from one corner of the room onto the coffee table) it quickly got confused about the actual physical relationships between the objects. It knew that physical properties of the objects (mass, elasticity) were necessary to solve the problem but is unable to actually use vector mathematics (or know that it needs it) for 3D space. It can explain why and what is necessary for a coffee table to be half a meter off the floor by default, but it doesn't appear to have a consistent method of identifying a universal centroid-selection strategy for the objects it understands. If I had to imagine what GPT's model feels like from the inside it would be a not-very-lucid dream where objects exist and move around but their physical properties are malleable and imprecise and mostly related by meanings/attributes/prepositions. I sometimes have dreams where I notice I appear to be floating as opposed to being properly on the floor, for example. Not having direct experience with spatial manipulations and feedback is the most likely cause and I would expect e.g. Gato to do much better at this kind of task.
Me: Imagine a cube that is 2 feet on each side. If this cube was in an average American living room, where might it fit without becoming an obstacle? Make a list of 5 places it could fit, and five places it would not fit. Explain the reason for each example.
ChatGPT: A 2-foot cube can fit in the following 5 places in an average American living room:
Under a coffee table: A cube of this size is small enough to fit under most coffee tables, which are typically low to the ground.
In a corner: Corners are often unused spaces, and a 2-foot cube can fit comfortably in one without becoming an obstacle.
Next to a sofa: A 2-foot cube can fit next to most sofas without taking up too much space or obstructing walkways.
In an unused corner of the room: If there's an empty corner in the room, a 2-foot cube can fit in it without causing any problems.
On a bookshelf: If the bookshelf has enough space, a 2-foot cube can fit on it and serve as a decorative piece or storage unit.
However, a 2-foot cube would not fit in the following 5 places in an average American living room:
On a small side table: If the side table is small, a 2-foot cube may not fit on it, as it would take up too much space.
In front of a television: A 2-foot cube placed in front of a television would obstruct the view and become an obstacle.
On a narrow mantel: If the mantel is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges.
In the middle of the room: Placing a 2-foot cube in the middle of the room would create an obstacle and impede movement.
On a windowsill: If the windowsill is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges and potentially obstruct the view.
Me: What differences would be in the first two lists if the cube was 4 feet on a side instead of 2?
ChatGPT: If the cube was 4 feet on each side, as opposed to 2 feet, the first two lists (places it could fit and places it would not fit) would have some differences.
Places it could fit:
In a closet: If the closet is big enough, a 4-foot cube could fit inside it, making use of otherwise unused space.
In a corner of a room: If the corner is large enough, a 4-foot cube can fit in it without becoming an obstacle.
Against a wall: If there's enough space along a wall, a 4-foot cube can be placed against it without obstructing walkways.
In a storage room: If the storage room is big enough, a 4-foot cube can fit inside it, making use of otherwise unused space.
In a garage: If the garage has enough space, a 4-foot cube can fit inside it, making use of otherwise unused space.
Places it would not fit:
Under a coffee table: A 4-foot cube is too large to fit under most coffee tables.
Next to a sofa: A 4-foot cube is too large to fit next to most sofas without taking up too much space or obstructing walkways.
On a bookshelf: A 4-foot cube is too large to fit on most bookshelves.
On a small side table: A 4-foot cube is too large to fit on most small side tables.
In front of a television: A 4-foot cube placed in front of a television would obstruct the view and become an obstacle.
These lists demonstrate that a larger cube would have fewer options for fitting in a living room, as it would be too big to fit in small or cramped spaces.
.... (Some elided prompts asking it about relating coordinate systems to real objects)
Me: Please imagine reasonable values in meters for all four of those coordinates
ChatGPT: Sure, here's an example of some reasonable values for the x, y, and z coordinates of the objects in the living room, expressed in meters:
Sofa: (2.5, 3.0, 0) Coffee table: (1.5, 4.0, 0.5) Bookshelf: (3.5, 2.0, 2.0) Television: (4.0, 3.5, 1.5) Of course, these values are just an example, and the actual values would depend on the size and layout of the specific living room in question. The important thing is that the coordinate system provides a way to describe the position of each object in the room in a precise and quantitative way.
Me: In your example, what does the z-coordinate of the coffee table mean, and explain your reasoning for setting it to 0.5. what is necessary for the table to be at this z coordinate?
ChatGPT: The z-coordinate of the coffee table in the example I gave, 0.5 meters, represents the height of the coffee table above the ground. This value of 0.5 meters was chosen to represent a typical height for a coffee table in a living room, which is usually around 0.4 to 0.6 meters off the ground.
To be at a height of 0.5 meters, the coffee table would need to be placed on a level surface, such as a rug or flooring, and would need to have legs or supports that elevate it to that height. Additionally, the coffee table would need to be relatively stable and not tipping over, otherwise its actual height would be different from the height specified by the z-coordinate.
In summary, the z-coordinate of an object in a coordinate system represents its height above a reference plane, in this case, the ground. The specific value of the z-coordinate depends on the height of the object and the levelness of the surface it is placed on.
Me: Imagine that the cube is thrown and lands on the coffee table, which breaks the coffee tables' legs. Calculate at what second the cube hits the table, including the speed of the cube when thrown to allow the collision. Calculate the coordinates of the coffee table is 1 second after the collision.
ChatGPT: To calculate the time it takes for the cube to hit the coffee table, we would need to know the initial velocity of the cube and the position of the coffee table when the cube was thrown. Based on the information provided, we do not have sufficient information to determine the initial velocity of the cube, so it's not possible to calculate the time of impact.
Additionally, since we don't have information about the mass and elasticity of the cube and coffee table, it's not possible to calculate the new position of the coffee table after the collision. The legs breaking would result in the coffee table changing its position, but the exact position would depend on many factors such as the materials used in the legs, the height of the coffee table, and the initial velocity of the cube.
In short, to accurately predict the time of impact and the position of the coffee table after the collision, more information about the physical parameters of the cube and coffee table would be needed.
Me: Assume the coffee table legs are massless and disappear instantly when broken. Assume the cube is 1kg and the table is 10kg and that the cube is thrown no faster than 4m/s
ChatGPT: Given the additional information, we can use the equations of motion to calculate the time of impact and the position of the coffee table after the collision.
First, let's calculate the time of impact. We can use the following kinematic equation to determine the time it takes for the cube to fall from its initial height to the height of the coffee table:
t = sqrt((2*h)/g)
where: t = time of impact h = height of the coffee table - height of the cube at the time of throw g = acceleration due to gravity (9.8 m/s^2)
Let's assume that the height of the coffee table is 0.5 m and the height of the cube at the time of throw is 2 m. Then, h = 0.5 - 2 = -1.5 m.
t = sqrt((2*-1.5)/9.8) = sqrt(3/9.8) = sqrt(0.306) = 0.551 s
Now, let's consider the velocity of the cube when it was thrown. Since it was thrown no faster than 4 m/s, let's assume it was thrown at 4 m/s.
Next, let's calculate the position of the coffee table after the collision. We know that the total momentum of the system (cube and coffee table) before the collision must be equal to the total momentum of the system after the collision. We can use the following equation to calculate the velocity of the coffee table after the collision:
v_f = (m1v1 + m2v2) / (m1 + m2)
where: v_f = final velocity of the coffee table after the collision m1 = mass of the cube (1 kg) v1 = velocity of the cube (4 m/s) m2 = mass of the coffee table (10 kg) v2 = velocity of the coffee table before the collision (0 m/s)
v_f = (14 + 100) / (1 + 10) = 4 m/s
The position of the coffee table 1 second after the collision would be:
x_f = x_i + v_f * t
where: x_f = final position of the coffee table after 1 second x_i = initial position of the coffee table (assumed to be 0) t = time elapsed (1 second)
x_f = 0 + 4 * 1 = 4 m
Note that these calculations are based on the idealized scenario where the coffee table legs are massless and disappear instantly when broken and that the cube and coffee table are point masses. In a real scenario, the behavior of the system would be more complex due to the influence of friction, air resistance, and the elasticity of the collision.