Comments

Sorted by

"Overtraining" isn't Chinchilla; Chinchilla is just "training". The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.

The pondering happens in earlier layers of the network, not in the output

Then how does it produce any tokens...?

then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.

But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.

(Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice.)

I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.

What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!)

That sounds like the pondering's conclusions are then related to the task.

This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.

Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't do a good job.

More problematically for your idea, if the conclusions are indeed 'unrelated to the task', then shouldn't they be just as likely to arise in every episode - including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of 'pondering'.

You need some incentive somewhere to learn good 'pondering'. (I have an example proposal for 'free play' which tries to teach a sort of 'pondering', but by stopping gradients, so anything learned in the initial steps is 'free', and so it can meta-learn to screw around and get something useful for free.)

Maybe it would look more random if you presented it segmented by token instead of translated into characters? I'm not familiar with the LLaMA tokenizations, but you seem to imply that a lot of the apparent patterns here are single tokens (like "partiellement" would be very surprising to me as the output of a greedy likelihood-minimizing sampling, but is trivial if it is a single BPE token). This would create a misleading impression of coherence.

Also, as Baginski notes, greedy sampling to minimize likelihood will not minimize total likelihood any more than greedy maximizing likelihood would maximize total likelihood. So it would be worth trying at least 'worst-of-n' sampling to see if it looks more like what you expect, in the same way that best-of-n often helps produce more expected LLM output. (After all, you would expect the tiniest logits to be the worst estimated of all logits, right? Full of sheer numerical noise and error, given that this is pushing 'dark knowledge' to its extremes. Who can really say much better or worse an answer, exactly, '衡' is than 'д' when following '*', etc? So if best-of-n can make such a qualitative difference when greedily sampling from the best-estimated logits..)

Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don't need to dump a lot of samples onto the Internet and pray they make it into the training data and don't get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.

This is something to think about: it's not just making it into the training data, it's making it into the agent's prompt or context that can matter. People are currently talking about how Deep Research is an example of the AI trend which will drive paywalls everywhere... which may happen, but consider the positives for people who don't put up paywalls.

Why not just 'valuable information', in a Value of Information sense of 'valuable'?

The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn't've said.

The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.

The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.

Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?

One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite 'easy' in that sense.

It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see 'large' robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, the largest robotics are usually <10b, with Gato 1's ~1b having been, if anything, unusually large because of all the other stuff it was doing. (I'm very behind on the robotics literature so maybe there are now much larger 100b-parameter models as they move into the 'foundation model' multi-modal/task scaling paradigm, but I'd bet that there still are none >1,000b.) Even sound/image/video generative models, which would be expected to be much larger than necessary for robotics tasks, are often small enough to run on a single consumer GPU, still. And these are usually trained with scaling laws now, so these are compute-optimal sizes and it is not just that they are wildly under-parameterized (the way almost all models were pre-2020).

So, if robotics is intrinsically easy, but animal brains do not show this because of their latency requirements, which forces them into misleadingly expensive brains, the implication is that we can do robotics by lifting the limitations of biological brains, like being forced to learn in realtime, in the real world, one animal at a time, without any sharing.

We should be able to train deep but small NNs in silico: turning all animal problems into Portia problems, if you will, pausing the simulation to let the NNs think & act for as long as necessary to plan the right action, and only then letting time flow to see what happens, and reset it to try again.

We remove all burdens of wallclock time or caloric consumption or childhood development, which are powerful general robotic controllers, and only then use these teacher-models to optimize low-latency controllers. The wider low-latency student models will be easier to train when they simply must imitate the teacher in a supervised-learning setting instead of RL from scratch, and so the size should be a lot better. (If nothing else, the student models can't 'die' if they make a mistake like breaking a latency constraint, so this learning setting is way easier than an animal's task.)


On a related note, it is also striking how far down in size LLMs can be pushed. You can get good reasoning out of tiny billion-parameter LLMs trained hard enough on high-quality-enough data, and the 'densifying experience curve' is steady and rapid (halving period of ~4 months), so we can expect that at some point we may have superhuman reasoning LLMs in the billion or sub-billion parameter range... which are just very, very ignorant, perhaps even more ignorant than you or me, of all the real-world knowledge & text that a proper LLM has. We can't train those from scratch, but we can train trillion-parameter LLMs to suck in all the text in the world, and then exhale training data for small fast cheap models.

So it seems that Moravec's Paradox remains undefeated: as difficult as we find the abstract intellectual capabilities like the process of doing math or reasoning, so difficult we struggle to even write them down to train LLMs on, so difficult to train on we need giant gigawatt datacenters to just get started, they are not intrinsically difficult and in the long run, do not require big expensive NNs.

But does that necessarily matter? Many of those models can't use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).

Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper?

OpenAI's Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren't cheap.

Load More