All of gwern's Comments + Replies

gwern*70

That looks pretty sensible overall, thanks.

You can see what looks like a fairly clear anti-pattern of switching languages/scripts, and the glitch-tokens may help explain the apparent patternness of the repetition in the non-token-split visualization: if LLaMA has " Хронологија" as a glitch-token, it may literally be unable to see that it's repeating a token by writing the apparently-patterned " Хронологија| Хронологија". Then it's not surprising if there are occasional repeats or 'too many' glitch-tokens (either birthday paradox as you scan over the sample... (read more)

gwern50

"Overtraining" isn't Chinchilla; Chinchilla is just "training". The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.

gwern20

The pondering happens in earlier layers of the network, not in the output

Then how does it produce any tokens...?

then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.

But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.

(Also, this assumes that RL gives

... (read more)
1Florian_Dietz
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results. Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
gwern31

This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.

Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't ... (read more)

1Florian_Dietz
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient. The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X. Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren't part of the immediate conversation. What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!) Or suppose that Y is "the user asks some question about biology" and X is "ongoing pondering about building dangerous organisms." (Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice because the implementation details here are not public knowledge.) (Also, also, it's possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
gwern20

Maybe it would look more random if you presented it segmented by token instead of translated into characters? I'm not familiar with the LLaMA tokenizations, but you seem to imply that a lot of the apparent patterns here are single tokens (like "partiellement" would be very surprising to me as the output of a greedy likelihood-minimizing sampling, but is trivial if it is a single BPE token). This would create a misleading impression of coherence.

Also, as Baginski notes, greedy sampling to minimize likelihood will not minimize total likelihood any more than ... (read more)

4niplav
Tokenizing the output of LLaMa gives: Some of the outputs are glitch-tokens for LLaMa-2-13b:
gwern70

Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don't need to dump a lot of samples onto the Internet and pray they make it into the training data and don't get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.

This is something to think about: it's not just making it into the training dat... (read more)

gwern20

Why not just 'valuable information', in a Value of Information sense of 'valuable'?

1CstineSublime
Bad information can inform a decision that detracts from the received value. I suppose if it is perceived to be valuable it still is a useful term - do you think that would get the point across better?
gwern167

The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn't've said.

The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.

The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.

2Thane Ruthenis
Good point regarding GPT-"4.5". I guess I shouldn't have assumed that everyone else has also read Nesov's analyses and immediately (accurately) classified them as correct.
gwern*141

Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?

One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite 'easy' in that sense.

It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see 'large' robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, ... (read more)

9jacob_cannell
The effectiveness of weight sharing (and parameter compression in general) diminishes as you move the domain from physics (simple rules/patterns tiled over all of space/time) up to language/knowledge (downstream facts/knowledge that are far too costly to rederive from simulation). BNNs cant really take advantage of weight sharing so much, so ANNs that are closer to physics should be much smaller parameter wise, for the same compute and capability. Which is what we observer for lower level sensor/motor modalities.
2Noosphere89
It might be at this point just an underinvestment in robotics, compared to other AI. Admittedly, Gato didn't have positive transfer, unlike all the other robotic elements.
gwern9-8

But does that necessarily matter? Many of those models can't use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).

I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.

I think it's wrong to jump to the conclusion that non-agent-finetuned models can't benefit from tools.


See for example:

Frontier Math result

https://x.com/Justin_Halford_/status/1885547672108511281

o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.

METR RE-bench

Model... (read more)

ozziegooen1510

I assume that what's going on here is something like,
"This was low-hanging fruit, it was just a matter of time until someone did the corresponding test."

This would imply that OpenAI's work here isn't impressive, and also, that previous LLMs might have essentially been underestimated. There's basically a cheap latent capabilities gap.

I imagine a lot of software engineers / entrepreneurs aren't too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain. 

So some people could look at this and say, "sure, this test is to be expected", and others would be impressed by what LLMs + simple tools are capable of. 

gwern70

Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper?

OpenAI's Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren't cheap.

2yo-cuddles
This is actually a good use case, which fits with what gpt does well, where very cheap tokens help! Pending some time for people to pick at it to test it's limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it's foothold into real commercial use that actually generates profit. My prediction is that we will be glad this exists. It will not be "phd level", a phrase which defaces all who utter it, but it will save some people a lot of time and effort Where I think we disagree: This will likely not elicit a Jevon's-paradox scenario where we will collectively spend much more money on LLM tokens despite their decreased cost, Killer app this is not. My prediction is that low level users will use this infrequently because Google (or vanilla chatGPT) is sufficient, what they are looking for is not a report but a webpage and one likely at the top of their search already. Even if it would save them time, they will never use it so often that their first instinct would be deep research and not Google, they will not recognize where deep research would be better and won't change their habits even if they do. On the far end, some grad students will use this to get them started but it will not do the work of actually doing the research. Besides pay walls disrupting things and limits to important physical media, there is a high likelihood that this won't replace any of the actual research grad students (or lawyers/paralegals etc) will have to do. The number of hours they spend won't be much effected, the range of users who will find much value will be few and they probably won't use it every day. I expect that, by token usage, deep research will not be a big part of what people use chatGPT for. If I'm wrong I predict it's because law professions found a use for it. I will see everyone in 1 year (if we're alive) to see if this pans out!
gwern51

Most people do not read many books or spend time in spaces where SAT vocab words would be used at all. If that was the sole determinant, you would then expect any vocab test to fail catastrophically and not predict/discriminate in most of the population (which would have downstream consequences like making SATs weirdly unreliable outside the elite colleges or much less predictive validity for low-performing demographics, the former of which I am unaware of being true and the latter of which I know is false); this would further have the surprising consequen... (read more)

5Steven Byrnes
For example, I’m sure I’ve looked up what “rostral” means 20 times or more since I started in neuroscience a few years ago. But as I write this right now, I don’t know what it means. (It’s an anatomical direction, I just don’t know which one.) Perhaps I’ll look up the definition for the 21st time, and then surely forget it yet again tomorrow. :) What else? Umm, my attempt to use Anki was kinda a failure. There were cards that I failed over and over and over, and then eventually got fed up and stopped trying. (Including “rostral”!) I’m bad with people’s names—much worse than most people I know. Stuff like that. If we’re talking about “most people”, then we should be thinking about the difference between e.g. SAT verbal 500 versus 550. Then we’re not talking about words like inspissate, instead we’re talking about words like prudent, fastidious, superfluous, etc. (source: claude). I imagine you come across those kinds of words in Harry Potter and Tom Clancy etc., along with non-trashy TV shows. I don’t have much knowledge here, and I’m especially clueless about how a median high-schooler spends their time. Just chatting :)
gwern51

One benefit of his 'no-nut January' is that by cutting out peanuts entirely, he's also avoiding problems from oxalates. I would expect powdered peanut butter to be as dangerous in that regard.

1Declan Molony
I'm not sure how I feel about seed oils generally, but I know they're higher in Omega-6 fatty acids. From the NIH: 
gwern142

And yet, despite the SAT being so studied for, it remains a pretty good IQ test overall, and SAT-V or the GRE verbal parts OK. I think that's because there are so many words (500k+ in English, and the GRE-V has no compunction about mining the obscurest just to f--- with you), and you would have to study so many in order to meaningful inflate your scores (because after all, while there may be only a hundred 'vocab words' on any given SAT test, you don't know which hundred). Let's see... Here's an interesting-looking reference: "How Many Words Do We Know? Pr... (read more)

gwern2115

then I think it is also very questionable whether the AI that wins wars is the most "advanced" AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.

Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best 'vibes', even w... (read more)

gwern86

Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.

3Erich_Grunewald
Hmm, if the Taiwan tariff announcement caused the NVIDIA stock crash, then why did Apple stock (which should be similarly impacted by those tariffs) go up that day? I think DeepSeek -- as illogical as it is -- is the better explanation.
gwern94

It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.

That's definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It's being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a bette... (read more)

gwern210

While we're at it, one example I learned afterwards was that the 'caribou randomization' story is probably bogus (excerpts):

We will show that hunters do not randomize their behavior, that caribou populations do not fluctuate according to human predation, and that scapulimancy apparently is not selected because it is ecologically advantageous. We shall also show that there is no cross-cultural evidence of divinatory random devices producing randomized subsistence behavior, but rather that people manipulate divination with the explicit or implicit interven

... (read more)
gwern80

Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.

It would be more precise to say outputs of o1 aren't supposed to include the reasoning traces. But in addition to the reasoning traces OA voluntarily released, people have been observing what seem to be leaks, and given that the history of LLM robustness to jailbreaks can be summarized as 'nil', it is at least conceivable that someone used a jailbreak+API to exfiltrate a bunch of tra... (read more)

gwern60

There is also GreaterWrong, which I believe caches everything rather than passing through live, so it would be able to restore almost all publicly-visible content, in theory.

gwern277

Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.

The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn't work (right now). You need to 'let t... (read more)

1wassname
In FBAI's COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens. * first it thinks with tokens * then they replace one thinking step with a latent <thought> token * then 2 * ... It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer. There's another relevant paper "Compressed Chain of Thought: Efficient Reasoning through Dense Representations" which used teacher forcing. Although I haven't read the whole thing yet.
7LGS
There's still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?    What seems to be going on is that these models are actually quite supervised, despite everyone's insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I'm wrong about this.    If I'm right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don't get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
4p.b.
So how could I have thought that faster might actually be a sensible training trick for reasoning models. 
gwern70

Fernando Boretti has a good 2022 post "Unbundling Tools for Thought" I don't think I saw before, but which makes some of these points at greater length and I largely agree with.

1Itay Dreyfus
Adding to my reading list, thanks.
gwern11-1

Holden was previously Open Philanthropy's CEO and is now settling into his new role at Anthropic.

Wait, what? When did Holden Karnofsky go to Anthropic? Even his website doesn't mention that and still says he's at Carnegie.

8DominikPeters
The Carnegie website says he is no longer at Carnegie. A Harvard page says: > Holden Karnofsky is a Member of Technical Staff at Anthropic, where he focuses on the design of the company's Responsible Scaling Policy and other aspects of preparing for the possibility of highly advanced AI systems in the future.  His LinkedIn confirms and dates the move to January 2025.
2Milan W
OpenPhil said in April 2024 that he left them for the Carnegie Endowment for International Peace. The Carnegie Endowment says he is no longer with them. His linkedin profile (which I presume to be authentic, because it was created in 2008) says he's at Anthropic since January 2025. EDIT: Additional source, Harvard's Berkman Klein Center.
3RobertM
I learned it elsewhere, but his LinkedIn confirms that he started at Anthropic sometime in January.
2Garrett Baker
Its on his Linkedin at least. Apparently since the start of the year.
6habryka
He posted it in a bunch of private Slacks a few weeks ago. His LinkedIn is now also updated: https://www.linkedin.com/in/holden-karnofsky-75970b7 
3Ben Pace
However it is on his LinkedIn.
gwern315

The shape of your face, and much else besides, will be affected by random chance and environmental influences during the process of development and growth.

The shape of your face will not be affected much by random chance and environmental influences. See: identical twins (including adopted apart).

1JeremyHussell
Yeah, bad example. Nonetheless, an adult human brain cannot be recreated solely from its genetic code, just as documents written using Microsoft Word cannot be recreated solely from the source code of Microsoft Word and an LLM cannot be recreated without training data. Most of the article falls apart because comparing source code size (uncompressed, note) to genome size tells us very little about the relative complexity of software and living organisms. Your brain probably is the most complex thing in the room, with ~86 billion neurons, each of which has a lot of state that matters.
2noggin-scratcher
Solid point. I realise I was unclear that for face shape I had in mind external influences in utero (while the bones of the face are growing into place in the fetus). Which would at least be a somewhat shared environment between twins. But nonetheless, changing my mind in real-time, because I would have expected more difference from one side of a womb to the other than we actually see between twins.  Even if I'm mistaken about faces though, I don't think I'm wrong about brains, or humans in general.
gwern226

There are other, more interesting and important ways to use that compute capacity. Nobody sane, human or alien, is going to waste it on running a crapton of simulations.

Counterpoint: speedrunning and things like 'Twitch plays', which are some of the most popular streaming genres in existence, and exist largely because they are unimportant. A TAS speedrunner may well run millions or billions of simulations simply to try to shave off 1s from the record. (An example I like to cite uses 6 CPU-years to bruteforce NES Arkanoid to achieve nearly optimal play. ... (read more)

gwern123

Why do you think that? Softbank, MS, Oracle, OpenAI etc are not governments, and the press release is not claiming to take any government money. Not to mention, this was to a considerable extent announced a year ago.

1Anders Lindström
Perhaps. https://www.politico.eu/article/us-elon-musk-troll-donald-trump-500b-ai-plan/ But Musk responded skeptically to an OpenAI press release that announced funding for the initiative, including an initial investment of $100 billion. “They don’t actually have the money,” Musk jabbed. In a follow-up post on his platform X, the social media mogul added, “SoftBank has well under $10B secured. I have that on good authority.”
gwern*182

An important update: "Stargate" (blog) is now officially public, confirming earlier $100b numbers and some loose talk about 'up to $500b' being spent. Noam Brown commentary:

@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.

This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.

...I don’t think t

... (read more)
gwern101

I'm sure it would be less flattering to me than my version, because people never remember these sorts of conversations the same way. If you think that it might not have happened like that, then just treat it as a hypothetical discussion that could have happened and ponder how contemporary Western lower-education systems can make truly transformative, rather than minor tinkering around the edges, use of AGI which preserves all existing compensation/status/prestige/job/political arrangements and which the teachers' unions and pension plans would not be impla... (read more)

gwern100

My point there is that he was talking to the reasoning team pre-hiring (forget 'onboarding', who knows what that means), so they would be unable to tell him most things - including if they have a better reason than 'faith in divine benevolence' to think that 'more RL does fix it'.

gwern5-2

A human player beating a random player isn't two random players.

I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?

Well, there's the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.

3Cole Wyeth
A random player against a good player is exactly what we’re looking for right? If all transcripts with one random player had two random players then LLMs should play randomly when their opponents play randomly, but if most transcripts with a random player have it getting stomped by a superior algorithm that’s what we’d expect from base models (and we should be able to elicit it more reliably with careful prompting).  I see no reason that transformers can’t learn to play chess (or any other reasonable game) if they’re carefully trained on board state evaluations etc. This is essentially policy distillation (from a glance at the abstract). What I’m interested in is whether LLMs have absorbed enough general reasoning ability that they can learn to play chess the hard way, like humans do - by understanding the rules and thinking it through zero-shot. Or at least transfer some of that generality to performing better at chess than would be expected (since they in fact have the advantage of absorbing many games during training and don’t have to learn entirely in context). I’m trying to get at that question by investigating how LLMs do at chess - the performance of custom trained transformers isn’t exactly a crux, though it is somewhat interesting. 
gwern*50

There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set

I wouldn't think that. I'm not sure I've seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)

the prompt suggests strong play.

Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo ("only games with players of Elo 1800 or higher were included in pretraining"), in standard behavior-cloning fashion.

1Cole Wyeth
I don't know, I almost instantly found a transcript of a human stomping a random agent on reddit: https://www.reddit.com/r/chess/comments/2rv7fr/randomness_vs_strategy/ This sort of thing probably would have been scraped? I was thinking that plenty would appear as the only baseline a teenage amateur RL enthusiast might beat before getting bored, but I haven't found any examples of anyone actually posting such transcripts after a few minutes of effort so maybe you're right. Chess-specific training sets won't contain a lot of random play. I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?
gwern133

Another example that reality (especially anything involving technology) is not constrained by the need to be realistic. What SF author would dare write a story with meme coins, much less one in which the meme coins involved AIs like Claude?

1Cole Wyeth
This seems more plausible post hoc. There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set, and the prompt suggests strong play. 
gwern*254

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

Something like that, yes. The devil is in the details here.

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to... (read more)

1aidanmcl
Great comment; thanks for the shoutout. > but suddenly generalize when train on diverse enough data as a blessing of scale Can you elaborate on this? I'd love to see some examples.
3Thane Ruthenis
He gets onboarded only on January 28th, for clarity.
5Noosphere89
My personal view is that OA is probably wrong about how far the scaling curves generalize, with the caveat that even eating math and coding entirely ala AlphaZero would be still massive for AI progress, though compute constraints will bind eventually. My own take is that the o1-approach will plateau in domains where verification is expensive, but thankfully most tasks of interest tend to be easier to verify than to solve, and lots of math/coding are basically ideally suited to verification, and I expect it to be way easier to make simulators that aren't easy to reward hack for these domains. Eh, those tautologies are both interesting on their own, combined with valuable training data so that it learns how to prove statements. I think the unmodelled variable is that they think software-only type singularities to be more plausible, ala this: Or this: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform#z7sKoyGbgmfL5kLmY
gwern*120

This post was good up until the LLM part, which is largely bullshit and applause lights which make no sense if you actually think about it (ah yes, I'm sure some 'audits' will fix this).

1quanticle
The whole thing might be LLM slop. Where do ICU nurses have a low enough workload that they can slack off on the job without consequences?
gwern102

The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would've had second thoughts about participating had they known OA was involved, almost surely.

gwern60

What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.

Jones wouldn't say that because that's just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don't have to. As long as it takes a board... (read more)

gwern*715

This sort of fundamental disagreement does lead to some frustrating conversations when you are talking at cross-purposes, and where even if both of you understand the difference, one of you may be talking at a different simulacrum level.

It reminds me of a conversation I had some time back with a school principal, which went something like this: He was trying to come up with proposals for how the school system could use LLMs, and naturally asked me for ideas, as I know a lot about LLMs and we'd discussed them in the past.

I replied that it was mostly a waste... (read more)

2ChristianKl
Maybe, his actual goal and as using AI for the purpose of signaling to other bureaucrats? Using AI in an innovative way might mean being able to apply to grants.
2MondSemmel
Total tangent: this article from 2011 attributes the quote to a bunch of people, and finds an early instance in a 1901 newspaper article.
3Joey KL
I would love to hear the principal’s take on your conversation.
5Mandatory Topic
Even if one accepts the premise that the purpose is not to educate children the education clearly still occurs and its effectiveness varries by school depending on a number of variables, many of which are controllable. Given that, you can increase how much the children learn without undermining the "true purpose" of the system, whatever one invisions that to be. To use your example, perhaps the children producing more reports actually does help them learn more. I am low confidence on that particular example but seems very possible to implement systems that increase educational efficiency below some threshold of costing current stakeholders material loses. I think your response here was too cynical. 
0Petropolitan
Even if LMMs (you know, LLMs sensu stricto can't teach kids read and write) are able to do all primary work of teachers, some humans will have to oversee the process because as soon as a dispute between a student and an AI teacher arises, e. g., about grades or because of the child not willing to study, parents will inherently distrust AI and require a qualified human teacher intervention. Also, since richer parents are already paying for more pleasant education experience in private schools (often but not always organized according to Montessori method), I believe that if jobs and daycare really become the focus of middle education taxpayers would gladly agree to move the school system into more enjoyable and perhaps gamified direction. Most likely some workers for whom a teacher wouldn't be a really appropriate term anymore (pedagogues?) will look after the kids and also oversee the AI teaching process to some extent
gwernΩ460

Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

3Cole Wyeth
This seems possible - according to this article almost every model got crushed by the easiest Stockfish: https://dynomight.net/chess/ But at the end he links to his second attempt which experimented with fine tuning and prompting, eventually getting decent performance against weak Stockfish. Actually he notes that lists of legal moves are actively harmful, which may partially explain the original example with random agents.  A cursory glance at publications on the topic seems to indicate that LLMs can make valid moves and somehow represent the board state (which seems to follow), but are still weak players even after significant effort designing prompts. Can you share any more definitive evidence? 
3Vanessa Kosoy
Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?
1ProgramCrafter
That article is suspiciously scarce on what microcontrols units... well, glory to LLMs for decent macro management then! (Though I believe that capability is still easier to get without text neural networks.)
gwern*358

Oh, the type of weirdness has definitely changed a lot. But I'm just contending that the level of deviancy is a lot lower these days.

You go to a LW meetup now and there's a lot of wealthy, well-scrubbed/dressed AI researchers (they even lift) and academics and executives and bright-eyed Stanford undergrads sniffing for an internship or YC application fodder. One famous wealthy guy is manic, because he's hypomanic & bipolar is overrepresented among entrepreneurs; don't worry, he'll be fine, until after the meetup when he disappears for a few months. Nob... (read more)

5Lucius Bushnaq
Come to LessWrong Community Weekend in Europe, we still have 'weird' people around. I don't know how we stack up to the pre-MoR crowd and I've never seen anyone who looked like they just got out of prison, but it's definitely not a bunch of people talking about normal politics or trying to make career connections.
gwern30

This refers only to the regular old finetuning, for 4o, and not to the fancy new RL finetuning for o1 that they recently opened up to alpha users, right?

3rife
Correct. Edit: I just realized you may have meant one of two things: * The post above was with regular 4o fine-tuning. * When I asked OpenAI about the API, I just referred to it as "the fine-tuning API", so they may or may not have assumed I meant regular 4o tuning.
gwern*26482

I think this is missing a major piece of the self-play scaling paradigm, one which has been weirdly absent in most discussions of o1 as well: much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. It was cool that o1's accuracy scaled with the number of tokens it generated, but it was even cooler that it was successfully bootstrapping from 4o to o1-preview (which allowed o1-mini) to o1-pro to o3 to...

EDIT: given the absurd response to this comment, I'd point out that I do not think OA has achieved AGI an... (read more)

Reply151063
2Cole Wyeth
What does the chinchilla scaling laws paper (overtraining small models) have to do with distilling larger models? It’s about optimizing the performance of your best model, not inference costs. The compute optimal small model would presumably be a better thing to distill, since the final quality is higher. 
1utilistrutil
Do we have evidence that this is what's going on? My understanding is that distilling from CoT is very sensitive—reordering the reasoning, or even pulling out the successful reasoning, causes the student to be unable to learn from it. I agree o1 creates training data, but that might just be high quality pre-training data for GPT-5.
gwern*182

An important update: "Stargate" (blog) is now officially public, confirming earlier $100b numbers and some loose talk about 'up to $500b' being spent. Noam Brown commentary:

@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.

This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.

...I don’t think t

... (read more)

If you're wondering why OAers are suddenly weirdly, almost euphorically, optimistic on Twitter

For clarity, which OAers this is talking about, precisely? There's a cluster of guys – e. g. this, this, this – claiming to be OpenAI insiders. That cluster went absolutely bananas the last few days, claiming ASI achieved internally/will be in a few weeks, alluding to an unexpected breakthrough that has OpenAI researchers themselves scared. But none of them, as far as I can tell, have any proof that they're OpenAI insiders.

On the contrary: the Satoshi guy straight... (read more)

0No77e
Nah, this has been the case since at least 2022 or earlier
4Noosphere89
I think they're misreading the scaling curves, because as of now it is very dependent on good verifiers for the problems at hand, which is basically math and coding are the only domains where very good verifiers are in place. This is still major if true, because eating coding/math absolutely speeds up AI progress, but there's a very important caveat to the results that makes me think they're wrong about it leading to ASI.
4Siebe
This is a really good comment. A few thoughts: 1. Deployment had a couple of benefits: real-world use gives a lot of feedback on strengths, weaknesses, jailbreaks. It also generates media/hype that's good for attracting further investors (assuming OpenAI will want more investment in the future?) 2. The approach you describe is not only useful for solving more difficult questions. It's probably also better at doing more complex tasks, which in my opinion is a trickier issue to solve. According to Flo Crivello: So this approach can generate data on complex sequential tasks and lead to better performance on increasingly longer tasks.
9LGS
Do you have a sense of where the feedback comes from? For chess or Go, at the end of the day, a game is won or lost. I don't see how to do this elsewhere except for limited domains like simple programming which can quickly be run to test, or formal math proofs, or essentially tasks in NP (by which I mean that a correct solution can be efficiently verified).   For other tasks, like summarizing a book or even giving an English-language math proof, it is not clear how to detect correctness, and hence not clear how to ensure that a model like o5 doesn't give a worse output after thinking/searching a long time than the output it would give in its first guess. When doing RL, it is usually very important to have non-gameable reward mechanisms, and I don't see that in this paradigm.    I don't even understand how they got from o1 to o3. Maybe a lot of supervised data, ie openAI internally created some FrontierMath style problems to train on? Would that be enough? Do you have any thoughts about this?
6lepowski
Unless releasing o1 pro to the public generates better training data than self-play? Self-play causes model collapse, While chat transcripts are messy OAI has them on a massive scale.
3Dentosal
I'd expect that deploying more capable models is still quite useful, as it's one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it's just a matter of speding compute to refine that?
7wassname
To illustrate Gwern's idea, here is an image from Jones 2021 that shows some of these self play training curves And so OAI employees may internally see that they are on the steady upward slope Perhaps constrained domains like code and math are like the curves on the left, while unconstrained domains like writing fiction are like curves to the right. Some other domains may also be reachable with current compute, like robotics. But even if you get a math/code/robotics-ASI, you can use it to build more compute, and solve the less constrained domains like persuasion/politics/poetry.
6wassname
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit. Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies. Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point. It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains. But maybe the counterpoint is just, GPU's go brrrr.
gwern90

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

gwern*280

Tolkien invented their exact usage, but he didn't invent the words. "Elf", obviously, goes way back, but "orc" also goes way back, with meanings similar to the Tolkien usage.

"Zerg", "Protoss", & "SCV", are all neologisms; notably, the least weird ones, "Kerrigan" and "Terran", are quite ordinary words. ('Hydralisk' is a bit in between. 'Hydra' as a prefix is familiar, albeit increasingly hopelessly overloaded with SF/comic connotations, but 'lisk' as a suffix is a very unfamiliar one: 'obelisk' is the only one that comes to mind, and that appears to g... (read more)

2Alex K. Chen (parrot)
"* this is why vocab can be a good IQ test: word use frequency is the original power law, and because you have been exposed to many more words than you consciously know, and how many of those words 'stick' will reflect your intelligence's efficiency at learning from 1 or 2 uses of a word, and thus provide a good proxy" It's still a weird efficiency, especially b/c it can be "gamed" by studying for SATs or by midwit infovoreautists who don't have high working memory.
1[comment deleted]
6quetzal_rainbow
I think in case of hydralisks it's analogous to basilisks, "basileus" (king) + diminitive, but with shift of meaning implying similarity to reptile.
gwern2016

In my case, as a former military firefighter in Brazil

FWIW, I would be interested in any memoirs or lessons learned about that career, quite aside from any formal research. I don't think there are many firefighters, former, military, or otherwise, on LW, and I bet you saw some interesting things.

2P. João
Thank you for your interest! My first idea for a post on LessWrong was actually about that—my journey from being a firefighter to discovering rationality. However, I hesitated because it felt very personal, and some of the most interesting parts of my story would be hard to verify. To summarize, I found myself unable to adapt to the "ethics" of the role, which eventually led me to leave and seek rationality as a way to rebuild my life. At the time, it felt like I had nothing left, as I had dedicated my entire life to becoming a firefighter. Interestingly, there are some parallels between my experiences and the Brazilian movies Tropa de Elite. That kind of intense, complex environment leaves you with stories that are hard to explain but deeply shape who you are. Thanks to your comment, though, I’m reconsidering publishing my story. Perhaps I could frame it as partly real, partly exaggerated—after all, not everything has to be 100% factual, right? Haha.
gwern72

Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any finepri... (read more)

3Edouard Harris
Yeah that could be doable. Dylan's pretty natsec focused already so I would guess he'd take a broad view of the ROI from something like this. From what I hear he is already in touch with some of the folks who are in the mix, which helps, but the core goal is to get random leaf node action officers this access with minimum friction. I think an unconditional discount to all federal employees probably does pass muster with the regs, though of course folks would still be paying something out of pocket. I'll bring this up to SA next time we talk to them though, it might move the needle. For all I know, they might even be doing it already.
gwern*152

Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)

Nor is it at all guaranteed that 't... (read more)

gwern260

Today, the cultures are closer, but the subcultures can be larger. Hundred years ago, there would be no such thing as the rationalist community.

That seems like a stretch, whether you put the stress on the 'community' or the 'rationalist' part. Subcultures can be larger, of course, if only because the global population is like 5x larger, but niche subcultures like 'the rationalist community' could certainly have existed then. Nothing much has changed there.

A hundred years ago was 1925; in 1925 there were countless communes, cults, Chinatowns/ghettos (or ... (read more)

Load More