"Overtraining" isn't Chinchilla; Chinchilla is just "training". The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.
The pondering happens in earlier layers of the network, not in the output
Then how does it produce any tokens...?
then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
...(Also, this assumes that RL gives
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.
Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't ...
Maybe it would look more random if you presented it segmented by token instead of translated into characters? I'm not familiar with the LLaMA tokenizations, but you seem to imply that a lot of the apparent patterns here are single tokens (like "partiellement" would be very surprising to me as the output of a greedy likelihood-minimizing sampling, but is trivial if it is a single BPE token). This would create a misleading impression of coherence.
Also, as Baginski notes, greedy sampling to minimize likelihood will not minimize total likelihood any more than ...
Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don't need to dump a lot of samples onto the Internet and pray they make it into the training data and don't get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.
This is something to think about: it's not just making it into the training dat...
Why not just 'valuable information', in a Value of Information sense of 'valuable'?
The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn't've said.
The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.
The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.
Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?
One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite 'easy' in that sense.
It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see 'large' robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, ...
But does that necessarily matter? Many of those models can't use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
I think it's wrong to jump to the conclusion that non-agent-finetuned models can't benefit from tools.
See for example:
https://x.com/Justin_Halford_/status/1885547672108511281
o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
Model...
I assume that what's going on here is something like,
"This was low-hanging fruit, it was just a matter of time until someone did the corresponding test."
This would imply that OpenAI's work here isn't impressive, and also, that previous LLMs might have essentially been underestimated. There's basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren't too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, "sure, this test is to be expected", and others would be impressed by what LLMs + simple tools are capable of.
Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper?
OpenAI's Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren't cheap.
Most people do not read many books or spend time in spaces where SAT vocab words would be used at all. If that was the sole determinant, you would then expect any vocab test to fail catastrophically and not predict/discriminate in most of the population (which would have downstream consequences like making SATs weirdly unreliable outside the elite colleges or much less predictive validity for low-performing demographics, the former of which I am unaware of being true and the latter of which I know is false); this would further have the surprising consequen...
One benefit of his 'no-nut January' is that by cutting out peanuts entirely, he's also avoiding problems from oxalates. I would expect powdered peanut butter to be as dangerous in that regard.
And yet, despite the SAT being so studied for, it remains a pretty good IQ test overall, and SAT-V or the GRE verbal parts OK. I think that's because there are so many words (500k+ in English, and the GRE-V has no compunction about mining the obscurest just to f--- with you), and you would have to study so many in order to meaningful inflate your scores (because after all, while there may be only a hundred 'vocab words' on any given SAT test, you don't know which hundred). Let's see... Here's an interesting-looking reference: "How Many Words Do We Know? Pr...
then I think it is also very questionable whether the AI that wins wars is the most "advanced" AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.
Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best 'vibes', even w...
Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.
It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
That's definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It's being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a bette...
While we're at it, one example I learned afterwards was that the 'caribou randomization' story is probably bogus (excerpts):
...We will show that hunters do not randomize their behavior, that caribou populations do not fluctuate according to human predation, and that scapulimancy apparently is not selected because it is ecologically advantageous. We shall also show that there is no cross-cultural evidence of divinatory random devices producing randomized subsistence behavior, but rather that people manipulate divination with the explicit or implicit interven
Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.
It would be more precise to say outputs of o1 aren't supposed to include the reasoning traces. But in addition to the reasoning traces OA voluntarily released, people have been observing what seem to be leaks, and given that the history of LLM robustness to jailbreaks can be summarized as 'nil', it is at least conceivable that someone used a jailbreak+API to exfiltrate a bunch of tra...
There is also GreaterWrong, which I believe caches everything rather than passing through live, so it would be able to restore almost all publicly-visible content, in theory.
Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn't work (right now). You need to 'let t...
Fernando Boretti has a good 2022 post "Unbundling Tools for Thought" I don't think I saw before, but which makes some of these points at greater length and I largely agree with.
Holden was previously Open Philanthropy's CEO and is now settling into his new role at Anthropic.
Wait, what? When did Holden Karnofsky go to Anthropic? Even his website doesn't mention that and still says he's at Carnegie.
The shape of your face, and much else besides, will be affected by random chance and environmental influences during the process of development and growth.
The shape of your face will not be affected much by random chance and environmental influences. See: identical twins (including adopted apart).
There are other, more interesting and important ways to use that compute capacity. Nobody sane, human or alien, is going to waste it on running a crapton of simulations.
Counterpoint: speedrunning and things like 'Twitch plays', which are some of the most popular streaming genres in existence, and exist largely because they are unimportant. A TAS speedrunner may well run millions or billions of simulations simply to try to shave off 1s from the record. (An example I like to cite uses 6 CPU-years to bruteforce NES Arkanoid to achieve nearly optimal play. ...
Why do you think that? Softbank, MS, Oracle, OpenAI etc are not governments, and the press release is not claiming to take any government money. Not to mention, this was to a considerable extent announced a year ago.
An important update: "Stargate" (blog) is now officially public, confirming earlier $100b numbers and some loose talk about 'up to $500b' being spent. Noam Brown commentary:
...@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.
This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.
...I don’t think t
I'm sure it would be less flattering to me than my version, because people never remember these sorts of conversations the same way. If you think that it might not have happened like that, then just treat it as a hypothetical discussion that could have happened and ponder how contemporary Western lower-education systems can make truly transformative, rather than minor tinkering around the edges, use of AGI which preserves all existing compensation/status/prestige/job/political arrangements and which the teachers' unions and pension plans would not be impla...
My point there is that he was talking to the reasoning team pre-hiring (forget 'onboarding', who knows what that means), so they would be unable to tell him most things - including if they have a better reason than 'faith in divine benevolence' to think that 'more RL does fix it'.
A human player beating a random player isn't two random players.
I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?
Well, there's the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.
There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set
I wouldn't think that. I'm not sure I've seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)
the prompt suggests strong play.
Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo ("only games with players of Elo 1800 or higher were included in pretraining"), in standard behavior-cloning fashion.
Another example that reality (especially anything involving technology) is not constrained by the need to be realistic. What SF author would dare write a story with meme coins, much less one in which the meme coins involved AIs like Claude?
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
Something like that, yes. The devil is in the details here.
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to...
This post was good up until the LLM part, which is largely bullshit and applause lights which make no sense if you actually think about it (ah yes, I'm sure some 'audits' will fix this).
The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would've had second thoughts about participating had they known OA was involved, almost surely.
What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.
Jones wouldn't say that because that's just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don't have to. As long as it takes a board...
This sort of fundamental disagreement does lead to some frustrating conversations when you are talking at cross-purposes, and where even if both of you understand the difference, one of you may be talking at a different simulacrum level.
It reminds me of a conversation I had some time back with a school principal, which went something like this: He was trying to come up with proposals for how the school system could use LLMs, and naturally asked me for ideas, as I know a lot about LLMs and we'd discussed them in the past.
I replied that it was mostly a waste...
Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.
Oh, the type of weirdness has definitely changed a lot. But I'm just contending that the level of deviancy is a lot lower these days.
You go to a LW meetup now and there's a lot of wealthy, well-scrubbed/dressed AI researchers (they even lift) and academics and executives and bright-eyed Stanford undergrads sniffing for an internship or YC application fodder. One famous wealthy guy is manic, because he's hypomanic & bipolar is overrepresented among entrepreneurs; don't worry, he'll be fine, until after the meetup when he disappears for a few months. Nob...
This refers only to the regular old finetuning, for 4o, and not to the fancy new RL finetuning for o1 that they recently opened up to alpha users, right?
I think this is missing a major piece of the self-play scaling paradigm, one which has been weirdly absent in most discussions of o1 as well: much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. It was cool that o1's accuracy scaled with the number of tokens it generated, but it was even cooler that it was successfully bootstrapping from 4o to o1-preview (which allowed o1-mini) to o1-pro to o3 to...
EDIT: given the absurd response to this comment, I'd point out that I do not think OA has achieved AGI an...
An important update: "Stargate" (blog) is now officially public, confirming earlier $100b numbers and some loose talk about 'up to $500b' being spent. Noam Brown commentary:
...@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.
This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.
...I don’t think t
If you're wondering why OAers are suddenly weirdly, almost euphorically, optimistic on Twitter
For clarity, which OAers this is talking about, precisely? There's a cluster of guys – e. g. this, this, this – claiming to be OpenAI insiders. That cluster went absolutely bananas the last few days, claiming ASI achieved internally/will be in a few weeks, alluding to an unexpected breakthrough that has OpenAI researchers themselves scared. But none of them, as far as I can tell, have any proof that they're OpenAI insiders.
On the contrary: the Satoshi guy straight...
Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.
Tolkien invented their exact usage, but he didn't invent the words. "Elf", obviously, goes way back, but "orc" also goes way back, with meanings similar to the Tolkien usage.
"Zerg", "Protoss", & "SCV", are all neologisms; notably, the least weird ones, "Kerrigan" and "Terran", are quite ordinary words. ('Hydralisk' is a bit in between. 'Hydra' as a prefix is familiar, albeit increasingly hopelessly overloaded with SF/comic connotations, but 'lisk' as a suffix is a very unfamiliar one: 'obelisk' is the only one that comes to mind, and that appears to g...
In my case, as a former military firefighter in Brazil
FWIW, I would be interested in any memoirs or lessons learned about that career, quite aside from any formal research. I don't think there are many firefighters, former, military, or otherwise, on LW, and I bet you saw some interesting things.
But humans don't seem to optimize for reward all that often!
You might be interested in an earlier discussion on whether "humans are a hot mess": https://www.lesswrong.com/posts/SQfcNuzPWscEj4X5E/the-hot-mess-theory-of-ai-misalignment-more-intelligent https://www.lesswrong.com/posts/izSwxS4p53JgJpEZa/notes-on-the-hot-mess-theory-of-ai-misalignment
Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any finepri...
Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)
Nor is it at all guaranteed that 't...
Today, the cultures are closer, but the subcultures can be larger. Hundred years ago, there would be no such thing as the rationalist community.
That seems like a stretch, whether you put the stress on the 'community' or the 'rationalist' part. Subcultures can be larger, of course, if only because the global population is like 5x larger, but niche subcultures like 'the rationalist community' could certainly have existed then. Nothing much has changed there.
A hundred years ago was 1925; in 1925 there were countless communes, cults, Chinatowns/ghettos (or ...
That looks pretty sensible overall, thanks.
You can see what looks like a fairly clear anti-pattern of switching languages/scripts, and the glitch-tokens may help explain the apparent patternness of the repetition in the non-token-split visualization: if LLaMA has " Хронологија" as a glitch-token, it may literally be unable to see that it's repeating a token by writing the apparently-patterned " Хронологија| Хронологија". Then it's not surprising if there are occasional repeats or 'too many' glitch-tokens (either birthday paradox as you scan over the sample... (read more)