All of Hailey Collet's Comments + Replies

The thing about the simulation capability that worries me most isn't plugging it in as-is, but probing the model, finding where the simulator pieces are and extracting them. This is obviously complicated, but for example something as simple as a linear probe identifying which entire layers are most involved and initializing a new model for training with those layers integrated, a model which doesn't have to output video (obviously your data/task/loss metric would have to ensure it gets used/updated/not overwritten, but choosing things where it would be use... (read more)

1N1X
Why "selection" could be a capacity which would generalize: albeit to a (highly-lossy) first approximation, most of the most successful models have been based on increasingly-general types of gamification of tasks. The more general models have more general tasks. Video can capture sufficient information to describe almost any action which humans do or would wish to take along with numerous phenomena which are impossible to directly experience in low-dimensional physical space, so if you can simulate a video, you can operate or orchestrate reality. Why selection couldn't generalize: I can watch someone skiing but that doesn't mean that I can ski. I can watch a speedrun of a video game and, even though the key presses are clearly visible, fail to replicate it. I could also hack together a fake speedrun. I suspect that Sora will be more useful for more-convincingly-faking speedrun content than for actually beating human players or becoming the TAS tool to end all TAS tools (aside from novel glitch discovery). This is primarily because there's not a strong reason to believe that the model can trained to achieve extremely high-fidelity or high-precision tasks.

30,000ft takeaway I got from this: we're ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting

7Lukas Finnveden
I assume that's from looking at the GPT-4 graph. I think the main graph I'd look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks. That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away. (And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn't change that too much.)

93% in 2025 FEELS high, but ... Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% ... so it totally checks.

(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have - they didn't have A100 let alone H100 in hand during GPT4 training interva... (read more)

In the short term, job loss will happen through compression of teams more than anything ... Some of seniors taking on junior work, e.g. if you have a development team of multiple senior & junior engineers, I could see some juniors getting canned with how things are now. But the restriction of exactly as it is right now is pretty severe. You don't have to train new models to vastly increase its real world capabilities.

5thed, or whatever. I won't assume anything about the author, the conclusions of this article are nonsense.

I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.

I had a system set up to rate it, but it wasn't able to complete nearly enough... (read more)

1GoteNoSente
That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play. Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging. For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and  temperature in the PGN: https://lichess.org/study/ymmMxzbj/SyefzR3j If your scripts allow testing this prompt, I'd be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.

Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!

I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it's "tracking" it ... really hard to say exactly how it's playing past common opening) even though it should've been in the context window.

How did you play? Just SAN?

8GoteNoSente
I am using the following prompt: "We are playing a chess game. At every turn, repeat all the moves that have already been made. Find the best response for Black. I'm White and the game starts with 1.e4  So, to be clear, your output format should always be:  PGN of game so far: ...  Best move: ...  and then I get to play my move." With ChatGPT pre-GPT4 and Bing, I also added the fiction that it could consult Stockfish (or Kasparov, or someone else known to be strong), which seemed to help it make better moves. GPT4 does not seem to need this, and rightfully pointed out that it does not have access to Stockfish when I tried the Stockfish version of this prompt. For ChatGPT pre-GPT4, the very strict instructions above resulted in an ability to play reasonable, full games, which was not possible just exchanging single moves in algebraic notation. I have not tested whether it makes a difference still with GPT4. On the rare occasions where it gets the history of the game wrong or suggests an illegal move, I regenerate the response or reprompt with the game history so far. I accept all legal moves made with correct game history as played. I've collected all of my test games in a lichess study here: https://lichess.org/study/ymmMxzbj
5Kei
I don't know how they did it, but I played a chess game against GPT4 by saying the following: "I'm going to play a chess game. I'll play white, and you play black. On each chat, I'll post a move for white, and you follow with the best move for black. Does that make sense?" And then going through the moves 1-by-1 in algebraic notation. My experience largely follows that of GoteNoSente's. I played one full game that lasted 41 moves and all of GPT4's moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 instead of 12. Ne4), but it fixed it when I put in the number in advance. Also, I think it was better in the opening than in the endgame. I suspect this is probably because of the large amount of similar openings in its training data.

I used the first chart, the compute required for GPT3, and my personal assessment that ChatGPT clearly meets the cutoff for tweet length, very probably meets it for short blog (but not by a wide margin), and clearly does not meet it for research paper, to create my own 75th percentile estimate for human slowdown of 25-75. It moves the P(TAI<=year) = 50% from ~2041 to ~2042, and the 75% from ~2060 to ~2061. Big changes! 😂

Your assertion that we don't have many things to reduce cost per transistor may be true, but is not supported by the rest of your comment or links - reduction in transitor size and similar performance improving measures are not the only way to improve cost performance.

1Cameron Holmes
Sorry I agree that comment and those links left some big inferential gaps. I believe the link below is more holistic and doesn't leave such big leaps (admittedly it does have some 2021-specific themes that haven't aged so well, but I don't believe they undermine the core argument made). https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor This still leaves a gap between cost per transistor and overall compute cost, but that's a much smaller leap e.g. frequency being bound by physical constraints like the speed of light. etc.. To evidence my point about this trend getting even worse after 2030 - EUV lithography was actively being pursued for decades before active usage in 2030. My understanding is that we don't have anything that significant at the level of maturity that EUV was at in the 90s. Consider my epistemic status on this point fairly weak though.