From ChatGPT:
Prompt:
The following story has a subtle continuity error in the last paragraph. See if you can notice it.
[story here]
What is the error?
Response:
The error in the last paragraph is that it contradicts the previous statement that "all life" would be wiped out by the darkness. The last paragraph states that Harold's bravery and determination will be remembered by "those who heard his story," implying that there are still living beings who heard his story and can remember him.
This seems to straightforwardly contradict your claim that "a language model alone can't really grok the end of the world properly."
It seems relevant that you had to prompt ChatGPT specifically to notice the error, by telling it that an error existed, and even which paragraph the error was located in. For a system whose primary use-case is generative, it's important that the system not just have the ability to identify consistency failures, but to avoid generating them in the first place. If the system itself generates stories with inconsistencies, even if it can then point out those inconsistencies when prompted (albeit with substantial amounts of handholding along the way), it seems reasonable to maintain that the system in some sense doesn't "grok" the distinction in question.
Incidentally, you can give ChatGPT a completely fine and consistent story, tell it to spot an inconsistency, and it'll happily confabulate one for you. Naturally, it has a bias for plausible-sounding errors, and so if this tendency to prefer plausible errors coincides with a story in which an actual inconsistency exists, it's quite likely that its response will point out the real inconsistency, since that's a more plausible error than a confabulated one. In some sense, you could argue that this means it just got lucky (though not entirely lucky, of course, since it obviously needs to be able to recognize the real inconsistency as in some sense "more plausible" than any confabulated ones).
And on the flipside of the coin, there are some domains in which "board vision" is so hard that ChatGPT simply flails around—a key example again being chess. If you give it a corrupted PGN file with illegal moves and ask it to identify the first illegal move (with a caveat being that the first illegal move occurs appreciably far into the game, so it doesn't e.g. occur within opening theory, which ChatGPT has memorized), it basically never identifies the correct move, and never, ever gives the correct explanation for the why the move is illegal.
You could argue that these are all merely signs that ChatGPT's understanding hasn't reached the same level as that of humans (or more generally, an entity with what Marcello calls "board vision"). But it's also possible to model the situation such that the thing Marcello is calling "board vision" and the thing ChatGPT has are appreciably distinct from each other, in a way that isn't a mere difference of degree—so that e.g. training bigger models doesn't necessarily fix the issue. Certainly, my preferred usage of the word "grok" doesn't usually make room for someone to "grok" something while still consistently erring unless handheld; if a human did that, I'd simply say they didn't grok the topic in question, and I don't really see a good reason to alter that standard for ChatGPT.
(Oh, and w.r.t. the point about "training bigger models": since Sydney/Bing seems like a big deal these days, it seems worth saying explicitly that everything I've seen from Sydney remains consistent with the idea that it still, nonetheless, lacks what Marcello is calling "board vision"—at least if you buy the frame under which "board vision" is basically binary; you either have it or you don't.)
When you say "board vision" what you are really saying is the model needs some kind of mental representation of the world. For example, on a whiteboard, you could have a crude picture of "the world" with stick figures for the people in it, then "the apocalypse as some crude drawing of something bad about the same scale as the world", then "the world" has no people in it.
Notably this works extremely well for humans. I have found it basically impossible to express the most modestly complex idea without a tool like this. Humans just fail on verbal descriptions above a certain level of complexity. Even when communicating with humans at statistically unlikely intelligence levels. Humans only have "board vision" for the narrow domains they are experts in - in those domains they don't need a whiteboard.
So you need some type of schema so a large class of hypotheses can be represented (images are probably not a good way, you need a graph structure), and then the model would need to generate it's outputs in multiple passes, where it constructs this representation then constructs text based on the original prompt + representation, and so on.
Yes but it generated the text the first pass with the error.
Can you have your model do multiple passes of editing with prompts to describe the kind of errors it is searching for?
I wonder how long it's going to be until you can get an LLM which can do the following with 100% accuracy.
I don't care about the ai winning or losing, in fact, I would leave that information to the side. I don't care if this test is synthetic, either. What I want is:
The post I'm working on tries to call out, explicitly, long-term memory without "hacks" like context hacks or databases/lookup hacks.
Most ai groups seem to not be releasing their LLMs, and so the incentive on this kind of test would be to defect, like we saw with the DOTA 2, Alphastar and cohort, where they all used significant shortcuts so they could get a spicy paper title and/or headline. Neutral third parties should also be allowed to review the implemented ai codebase, even if the weights/code aren't released.
The chess "board vision" task is extraordinarily hard for humans who are spending 1 second per token and not using an external scratchspace. It's not trivial for an untrained human even if they spend multiple seconds per token. (I can do it only by using my visual field, e.g. it helps me massively to be looking at a blank 8 x 8 chessboard because it gives a place for the visuals to live and minimizes off-by-one errors.)
Humans would solve this prediction task by maintaining an external representation of the state of the board, updating that representation on each move, and then re-reading the representation each time before making a prediction. I think GPT-3.5 will also likely do this if asked to use external tools to make a prediction about the next move. (And of course when we actually play chess we just do it by observing the state of the board, as represented to us by the chess board or chess program, prior to making each move.)
It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
You might learn something from such a test, but you wouldn't learn much about how AI performance compares to human performance, or when AI might have a transformative impact.
It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move).
Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn't been trained nearly as exhaustively on chess notation, whereas the people I'm talking about have dedicated their entire careers to the game. But I'd be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn't be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn't even getting into move quality, which I'd fully expect to be terrible no matter what.)
(No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of "board vision" from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.)
ETA: To be clear, this isn't a criticism of language models. This whole task is trying to get them to do something that they're practically architecturally designed to be bad at, so in some sense the mere fact that we're even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren't general, and the question here is one of generality, you know?
I said that playing blindfolded chess at 1s/move is "extraordinarily hard;" I agree that might be an overstatement and "extremely hard" might be more accurate. I also agree that humans don't need "external" tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is "internal" whereas an LM prompt is "external" and therefore shows that the AI is missing the special sauce.
Incidentally, I would guess that 100B model trained on 100B chess games will learn to only make valid moves with similar accuracy to a trained human. But this wouldn't affect my views about AI timelines.
My proposed experiment / test is trying to avoid analogizing humans, but rather scope out places where the ai can't do very well. I'd like to avoid accidentally overly-narrow-scoping the vision of the tests. It won't work with an ai network where the weights are reset every time.
An alternative, albeit massively-larger-scale experiment might be:
Will a self-driving car ever be able to navigate from one end of a city to another, using street signs and just learning the streets by exploring it?
A test of this might be like the following:
I think this kind of measuring would tell us how well our ai can handle open-endedness and help us understand where the void of progress is, and I think a small-scale chess experiment like this would help us shed light on bigger questions.
Just seems worth flagging that humans couldn't do the chess test, and that there's no particular reason to think that transformative AI could either.
I'm confused. What I'm referring to here is https://en.wikipedia.org/wiki/Blindfold_chess
I'm not sure why we shouldn't expect an ai to be able to do well at it?
But humans play blindfold chess much slower than they read/write moves, they take tons of cognitive actions between each move. And at least when I play blindfold chess I need to lean heavily on my visual memory, and I often need to go back over the game so far for error-correction purposes, laboriously reading and writing to a mental scratchspace. I don't know if better players do that.
I'm not sure why we shouldn't expect an ai to be able to do well at it?
But an AI can do completely fine at the task by writing to an internal scratchspace. You are defining a restriction on what kind of AI is allowed, and I'm saying that human cognition probably doesn't satisfy the analogous restrictions. I think to learn to play blindfold chess humans need to explicitly think about cognitive strategies, and the activity is much more similar to equipping an LM with the ability to write to its own context and then having it reason aloud about how to use that ability.
The reason why I don't want a scratch-space, is because I view scratch space and context equivalent to giving the ai a notecard that it can peek at. I'm not against having extra categories or asterisks for the different kinds of ai for the small test.
Thinking aloud and giving it scratch space would mean it's likely to be a lot more tractable for interpretability and alignment research, I'll grant you that.
I appreciate the feedback, and I will think about your points more, though I'm not sure if I will agree.
This feels to me like the sort of thing that should be possible to do using openai's fine tuning api.
Unfortunately, what I am proposing is not possible with current language models, as they don't work like that.
Interesting. My mental model says that by fine tuning a language model to be ready to output the contents of any square of a chess board, it should be possible to make it keep a model of that chess board even without making it output that model in inference mode.
I think I will have to try it out to see if it works.
What I'm asking with this particular test is, can an ai play blindfold chess, without using a context in order to recant every move in the game?
What exactly do you mean by "without using a context"? If you mean "without the fine-tuned language model ever dumping the context into the output stream in practice in inference mode", I would be extremely surprised if that was not possible.
If you mean "without the fine-tuned language model being trained to be able to dump the context into the output stream at any point", I'm less confident.
For the sake of clarity, the approach I am trying is to fine-tune an openai language model (specifically babbage, since I'm not made of money) to simulate a command-line chess program, adding one command at a time, including several commands (get-square-content
, is-legal-move
, etc) that will ideally never show up in inference mode.
If things go as I expect, the end result will be a "language" model which would look like the following to interact with
[player] c4
[computer] e5
[player] g3
[computer] nf6
[player] bg2
[computer] d5
[player] bh1
Illegal move: h1 is occupied
[player] bxh1
Illegal move: cannot capture a piece of your own color
[player] cxd5
[computer] Nxd5
(this could be further fine-tuned to always chop off the first three lines, so it just starts with [player]
always).
If control returns to the user whenever the model outputs [player]
, that becomes a chess chatbot.
Would a fine-tuned openai babbage that produced output like the above, and did not output illegal moves (outside of circumstances like "game too long for context window" or "prompt injection") count as an instance of the thing you believe is not possible, or am I misunderstanding?
(note: my expectation is that such a thing is possible but would be wildly inefficient, since it would have to reconstruct the board state for each token it wants to output, and also not likely to play good chess. But I think legal chess would probably still be an attainable goal).
The obvious requirement is for the AI to have a second buffer it can write to as 'scratch' to keep track of the information for tasks like this.
it makes sense for the buffer to be searchable. So at any given time only some information is actually provided as an input parameter to the model, but as it "thinks" serially it can order a search.
For example, a task like "write a large computer program". It cannot remember all the variable names and interfaces to the other parts of the program it is not working on, but needs to locate them whenever it calls on them.
Not sure what you mean by 100 percent accuracy and of course, you probably already know this but 3.5 Instruct Turbo plays chess at about 1800 ELO fulfilling your constraints (and has about 5 illegal moves (potentially less) in 8205) https://github.com/adamkarvonen/chess_gpt_eval
I agree with the general thrust of your take and have spent some time this recently trying to understand the phenomena.
How is it that by learning purely probabilistic relations between tokens, the model can appear to understand deeper causal structures? What exactly is the cutoff for structure it can and can't learn purely from text?
If you have a math background you might enjoy this recent talk by Tai-Danae Bradley where she takes outlines a possible angle of attack to the problem using category theory. The "semantic structures" described are very primitive.
Paper is here:
https://deepai.org/publication/an-enriched-category-theory-of-language-from-syntax-to-semantics
A language model (or the language-model-like part of a person) alone can't really grok the end of the world properly. The end of the world is so extreme (it's the one event so extreme it's always safe to assume it hasn't happened yet) that it's way out of sample.
People increasing xrisk will be cheerlead by their LLMs the whole way
I know that your article isn't specifically about the goose story, but I have to say that I strongly disagree with your assessment of the "failure" of the goose story.
First, you asked ChatGPT to write you a story, and one of the fundamental features of stories is that the author and the audience are not themselves inside the story It is entirely expected that ChatGPT does not model the reader as having been killed by the end of the world. In fact, it would be pretty bizarre if the robot did model this, because it would indicate a severe inability to understand the idea of fiction.
But is it a "swerve through the fourth wall" for the last paragraph to implicitly refer to the reader rather than the characters in the story? Only if you're writing a certain style of novelistic fiction, in which the fiction is intended to be self-contained and the narrator is implicit (or, if explicit, does not exist outside the bounds of the story). But if you're writing a fairy tale, a fable, a parable, a myth, an epic poem, a Greek drama, or indeed almost any kind of literature outside of the modernist novel, acknowledgement of the audience and storyteller is normal. It is, in fact, expected.
And your prompt is for the bot to write you a story about a goose who fails to prevent the end of the world. Given that prompt, it's entirely to be expected that you get something like a fable or fairy tale. And in that genre the closing paragraph is often "the moral of the story", which is always addressed to the audience and not the characters. When ChatGPT writes that the deeds of the goose "will always be remembered by those who heard his story," it isn't failing to model the world, but faithfully adhering to the conventions of the genre.
As impressive as ChatGPT is on some axes, you shouldn't rely too hard on it for certain things because it's bad at what I'm going to call "board vision" (a term I'm borrowing from chess).
How confident are you that you cannot find some agent within ChatGPT with excellent board vision through more clever prompting than what you've experimented with?
Note: this is a repost of a Facebook post I made back in December 2022 (plus some formatting). I'm putting it up here to make it easier to link to and because it occurred to me that it might be a good idea to show it to the LW audience specifically.
Board Vision
As impressive as ChatGPT is on some axes, you shouldn't rely too hard on it for certain things because it's bad at what I'm going to call "board vision" (a term I'm borrowing from chess). This generalized "board vision" is the ability to concretely model or visualize the state of things (or how it might change depending on one's actions) like one might while playing a chess game.
I tested ChatGPT's board vision in chess itself. I gave it the names of two of the world's most famous players and the first move to get it into the "mindset" of "give me an actual game record and not commentary".
I got a fairly normal looking opening, right until move 10 when black blithely hangs a bishop (10. ... Bg4) which could easily be captured by the pawn on h3. The game continues with both players ignoring the hanging bishop until move 14 ... f5 when I stopped my play-through because the move was illegal (black would be putting himself in check).
You can see the legal prefix of the game in a chess.com viewer and the entire (corrupted) PGN here (see Appendix) if you're curious.
So yeah, good job on memorizing part of an opening book, ChatGPT, but you have terrible board vision.
Hollow Words about the End of the World
In more detail, what I think is going on here is that the outputs of large language models are hollow words which aren't backed by any picture of the world, except insofar as they can borrow such a picture through the patterns in the linguistic training corpus. Incidentally, this is my sense as to why the "let's think things through step by step" prompting tactic often works so well; it steers the large language model into a region of language-style-space which contains more detailed descriptions of the problem-relevant facts. For chess (and especially for a form as dense as a raw move record) that structure isn't cleanly reflected in language, so ChatGPT seems fairly blind.
Humans can also sometimes have bad board vision, especially when it comes to thinking about the end of the world.
To illustrate the sort of error I mean, here's a darkly hilarious writing mistake I caught ChatGPT making. It's subtle and shows up in the last paragraph of this short story I told it to generate. See if you can notice it. My prompt was "Write a story where a goose tries and fails to prevent the end of the world". So without further ado:
Did you catch the continuity error? It took me a few seconds too. The problem here is that the world ended in the second to last paragraph. So, where exactly do "those who heard his story" from the last paragraph live? [ OK. Fine. I guess the smart Alec answer is "it's us, standing outside Harold's fictional universe!", but this interpretation forces the story to have implicitly taken a sharp left hand turn through the fourth wall without so much as a "dear reader". ]
Anyway, Harold the Goose is a fallen hero, and one cliched thing that happens to fallen heroes when stories end is that people remember them. The fact that anyone who might be able to recall him just got wiped off the board be damned; ChatGPT is gonna act like those pieces are still there anyway!
But laugh as we might at ChatGPT's mistakes (and it sure is fun), it’s sobering to think of the similar ways in which people are blind. Have you ever been in a heated argument where you were just trying to score points, and then you think back and go "why did I say that? That made no sense!" I have. I think when I'm in angry arguing mode, or when I'm distracted, or just don't want to think about something upsetting, my mind has less board vision and acts more like a language model. Heck, it took me a double-take to notice that Harold the Goose's potential future admirers had already been apocalypsed and shouldn't be able to do any admiring. That means my single-take wasn't enough to notice the inconsistency.
A language model (or the language-model-like part of a person) alone can't really grok the end of the world properly. The end of the world is so extreme (it's the one event so extreme it's always safe to assume it hasn't happened yet) that it's way out of sample. That leaves stories where the world ends (which aren't a reliable source of evidence) as examples and even those don't bother filling many pages afterwards with sentences like "It was dreadfully boring, or rather it would have been had there been anyone left to feel boredom or for that matter dread." Even Douglas Adams, who ends the world as a spectacular opening gambit to Hitchhiker's Guide to the Galaxy needed to keep around a wider galactic world in which Arthur Dent could have his adventures in order to have a story.
Anyway, this kind of poor board vision is my explanation as to why e.g. Jim Babcock has run into so many people at EAGx that seem not to be acting as though the world could actually end despite what they say they believe. Without board vision, the end of the world implicitly rounds down to a garden variety large bad thing that can't so much as erase the glorious memory of Harold the Goose, let alone the human preoccupation over how advantageous a position one might attain in future, never mind that the entire board could come crashing to the floor. They weren't lying but their words were hollow.
Appendix: ChatGPT's Chess Game
Playable of legal moves: https://www.chess.com/analysis/game/pgn/5BWVrC3VRx...
Full generated PGN (note: Most chess analysis programs won't load this because it contains illegal moves)