All of Josh You's Comments + Replies

I think there are two models that you measured time horizon for, Claude 3 Opus, and GPT-4 Turbo, that didn't make it onto the main figure. Is that right? There are 13 models in Figure 5, which shows the time horizon curves for a bunch of models across the full test suite, and only 11 dots on Figure 1.

AI has probably increased valuations for Big Tech (particularly Nvidia) by at least a few trillion over the past two years. So part of this is that investors think OpenAI/Anthropic will only capture around 10% of total AI profits.

65T tokens doesn't get you to 1e26 FLOP with 100B active params? You'd need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.

GPT-4.5 being trained on fewer tokens than GPT-4o doesn't really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn't quite make sense either, though I'm not as confident that's wrong.

1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to... (read more)

2Vladimir_Nesov
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn't seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It's difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would've been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference). The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta's 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.
2Vladimir_Nesov
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it's 3x more data than that, so for 150T total tokens you'd need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B). How so? If it uses 3x more compute but isn't 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn't taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it's a very salient thing for use in production models.

I don't think GPT-4o was trained on 1e26 FLOP or particularly close to it. Overtraining is common but GPT-4o being overtrained by 10x for 1e26 FLOP is kind of a strong and surprising claim (some models like Llama 3 8b are extremely overtrained but they're small so this overtraining is cheap). I think a more natural explanation is that it improves on GPT-4 because of superior post-training and other innovations.

2Vladimir_Nesov
There is a report that OpenAI might've been intending to spend $3bn on training in 2024 (presumably mostly for many smaller research experiments), and a claim that the Goodyear site has 3 buildings hosting 100K H100s. One of these buildings is 32K H100s, which at 40% utilization in 3 months produces 1e26 FLOPs (in BF16), which in GPU-time at $2/hour costs $140m. So it seems plausible that Azure already had one of these (or identical) datacenter buildings when GPT-4o was ready to train, and that $140m wasn't too much for a flagship model that carries the brand for another year. With this amount of compute and the price of $2.5 per 1M input tokens, it's unlikely to be compute optimal. For MoEs at 1e26 FLOPs, it might be compute optimal to have 120-240 tokens/parameter (for 1:8-1:32 sparsity), which is 370B active parameters for a 1:8 sparse MoE or 260B for a 1:32 sparse MoE. Dense Llama-3-405B was $5 per 1M input tokens at probably slimmer margins, so GPT-4o needs to be more like 100B active parameters. Thus 3x less parameters than optimal and 3x more data than optimal (about 45T-65T 135T-190T trained-on tokens, which is reasonable as 3-4 5 epochs of 15T-20T 25T-40T unique tokens), giving 10x overtraining in the value of tokens/parameter compared to compute optimal. The penalty from 10x overtraining is a compute multiplier of about 0.5x, so a 5e25 FLOPs compute optimal model would have similar performance, but it would have 2x more active parameters than a 10x overtrained 1e26 FLOPs model, which at $70m difference in cost of training should more than pay for itself.
Josh You*40

The high cost and slow speed of GPT-4.5 seems like a sign OpenAI is facing data constraints, though we don't actually know the parameters and OpenAI might be charging an bigger margin than usual (it's a "research preview" not a flagship commercial product). If data was more abundant, wouldn't GPT-4.5 be more overtrained and have fewer parameters? 

edit: FWIW Artificial Analysis measures GPT-4.5 at a not-that-bad 50 tokens per second whereas I've been experiencing a painfully slow 10-20 tokens/second in the chat app. So may just be growing pains until t... (read more)

Josh You108

if OpenAI follows the usual naming convention of roughly 100x in raw compute.

I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

4Vladimir_Nesov
I'm merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I'm guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that's more concrete.

In Holden Karnofsky's "AI Could Defeat All Of Us Combined" a plausible existential risk threat model is described, in which a swarm of human-level AIs outmanoeuvre humans due to AI's faster cognitive speeds and improved coordination, rather than qualitative superintelligence capabilities. This scenario is predicated on the belief that "once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each." If the first AGIs are as expe

... (read more)
Josh You*50

By several reports, (e.g. here and here) OpenAI is throwing enormous amounts of training compute at o-series models. And if the new RL paradigm involves more decentralized training compute than the pretraining paradigm, that could lead to more consolidation into a few players, not less, because pretraining* is bottlenecked by the size of the largest cluster. E.g. OpenAI's biggest single compute cluster is similar in size to xAI's, even though OpenAI has access to much more compute overall. But if it's just about who has the most compute then the biggest pl... (read more)

AI systems can presumably be given at least as much access to company data as human employees at that company. So if rapidly scaling up the number and quality of human workers at a given company would be transformative,  AI agents with >=human-level intelligence can also be transformative.

I think a little more explanation is required on why there isn't already a model with 5-10x* more compute than GPT-4 (which would be "4.5 level" given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up). 

You'd need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4.  This has been available to the biggest hyperscalers since sometime last year. Naively it might... (read more)

6Nathan Helm-Burger
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don't have a definite answer to this, but I have some guesses. It could be a combination of any of these. * Keeping up with inference demand, as Josh mentioned * Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking * choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs * wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall. * wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we'd expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing! * if safety and quality testing is done (as I expect it will be), then flaws fo

GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that's for sparse computation that isn't relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.

With 100K H100s... (read more)

Josh You178

Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).

AI that can rewrite CUDA is a ways off. It's possible that it won't be that far away in calendar time, but it is far away in terms of AI market growth and hype cycles. If GPT-5 does well, Nvidia will reap the gains more than AMD or Google.

The US is currently donating doses to other countries in large quantities. Domestically, it has around 54m doses distributed but not used right now. (https://covid.cdc.gov/covid-data-tracker/#vaccinations). Some but certainly not all of those are at risk of expiration. If US authorities recommended booster shots for the general population then that would easily use up the currently unused supply and reduce vaccine exports.

Josh You390

I did it, I did it, I did it, yay!

Josh You120

A compromise that I find appealing and might implement for myself is giving a fixed percentage over a fixed amount, with that fixed percentage being relatively high (well above ten percent). You could also have multiple "donation brackets" with an increased marginal donation rate as your income increases.

-3Tedav
I like this approach. It makes sense, and it mostly dodges the problem that other "simple" formulae for charity have - namely that most simple systems tend to be essentially voluntary regressive taxation. This is why the 10% rule has always bugged me - it is a culturally accepted voluntary regressive tax, and as such it exacerbates social inequality. [Also, one of my friends likes to joke that our culture holds that you give 10% of your income to charity, but capital gains are exempt...] I'm always on the lookout for things that seem innocuous or even beneficial that actually are ways of enforcing the social structure and preventing upwards mobility, like our strange insistence on prescriptive rules of language, and upon the necessity of "sounding intelligent". Language are evolved social constructs, and "correct grammar" is determined by native speakers. However, we impose additional rules that stray from the natural form of the language, and develop a notion that certain ways of speaking/writing are proper, and that other ways are ignorant. To learn how to speak in a way that sounds intelligent requires additional investment of time and effort, and those that cannot afford to do so (can't afford to spend as much time reading, or comes from an area with worse schools) will grow up speaking a completely intelligible version of the language, but one that is generally recognized as sounding like a marker of ignorance, and thus limits possibilities for advancement. Ok, I really got off topic there, but my point was that our cultural construct that people should give a fixed percentage of their income to charity might very well not be a force for good, but rather a force opposing good. It is a regressive taxation system, but one that is culturally supported. Further, because so many people feel like everyone is already voluntarily consenting to give to charity (especially through religious organizations) that actual taxation is an unnecessary imposition. If we d
2Peter Wildeford
This is what Peter Singer proposes.
4gjm
Yes, I like this approach too.

I doubt an IQ test would be useful at all. One has to be quite intelligent to be a real candidate for presidency.

0James_Miller
But it could make a big difference if a president is only two as opposed to, say, three standard deviations above the mean.

He also likes arguing with Jeff Kaufman about effective altruism.

7jefftk
I went to college with him. More of us arguing: http://www.jefftk.com/p/value-and-money He doesn't like lesswrong at all, for reasons similar to Apophemi's.
0syllogism
Can't say I'm impressed with his reasoning there. Interesting.

Probably shouldn't say someone "probably" has an IQ between 145 and 160 unless you have pretty good evidence.

4Jayson_Virissimo
Well, according to this popular science article, the average physics Ph.D. has an I.Q. of 143+.
Josh You130

I think it makes a big difference if the preferred theory is gender/racial equality as opposed to fundamentalist Christianity, and whether the opposition to those perceived challenges result from emotional sensitivity as opposed to blind faith. At the very least, the blog post doesn't indicate that the author would be irrational about issues other than marginalization.

5Watercressed
Does fundamentalist Christianity indicate that the believer would be irrational about issues other than religion? If yes, what's the difference?

I don't see how the fact that the permissiveness principle is only based on one (two, actually, including the third one) of the six foundations would imply that it's not a widely-held intuition.

3knb
You're falsely conflating the permissiveness principle with those moral foundations. The permissiveness principle is a much stronger position, which states that things are only immoral if they cause harm.

How risk-averse are you? But even if you aren't, I suspect that right now bitcoins aren't a great investment strictly in expected-value terms due to the high risk that they will decline in value by a lot. No one really knows what will happen, though.

6Ishaan
I'm fairly risk averse in the sense that I wouldn't ordinarily feel confident enough in my knowledge to play on speculative markets...I just happen to possess BTC because someone liked something I wrote and kindly tipped me in BTC a couple years ago. I'd forgotten about it for a while, and today realized that it's worth 20x more than it was originally. So basically my inaction caused me to inadvertently made a good investment. And even though I wouldn't ordinarily have tried investing in BTC, now that the decision to invest has essentially been made for me I curiously haven't immediately decided to convert to dollars. Status quo bias is interesting - I'm tempted just to keep it and watch what happens...so far it's still rising. I guess the amount on the line isn't actually large enough for loss aversion fear to kick in .

Another possible critique is that the philosophical arguments for ethical egoism are (I think) at least fairly plausible. The extent to which this is a critique of EA is debatable (since people within the movement state that it's compatible with non-utilitarian ethical theories and that it appeals to people who want to donate for self-interested reasons) but it's something which merits consideration.

Ehh, I think that's pretty much what rule util means, though I'm not that familiar with the nuances of the definition so take my opinion with a grain of salt. Rule util posits that we follow those rules with the intent of promoting the good; that's why it's called rule utilitarianism.

0DanielLC
Wikipedia link. I'm pretty sure the first time I read this, it specifically stated that using rules of thumb is not the same as rule utilitarianism. As it is, it's less clear, but I'm pretty sure it's still saying that they are two different ideals, rather than just different strategies.

That would be a form of deontology, yes. I'm not sure which action neo-Kantians would actually endorse in that situation, though.

I think that's accurate, though maybe not because the programming jargon is unnecessarily obfuscating. The basic point is that following the rule is good in and of itself. You shouldn't kill people because there is a value in not killing that is independent of the outcome of that choice.

0DaFranker
As an attempt to remove the programming jargon (I don't know of any words or expressions which express the same concept without math or programming jargon of some kind): For that example, skipping the traditional "Kill this one or five others die!" dilemma, if we suppose the person to be killed will revive on their own and thereby become immortal, with no additionnal side effects, the deontological rule still takes precedent and therefore it is good to let the person later die of old age. Rule consequentialism, in such a corner case, would want the person to end up immortal. Correct?

Your description of deontological ethics sounds closer to rule consequentialism, which is a different concept. Deontology means that following certain rules is good in and of itself, not because they lead to better decisionmaking (in terms of promoting some other good) in situations of uncertainty.

0DanielLC
It sounds more like act utilitarianism to me. Rule utilitarianism is when you notice that lying usually has bad consequences, and therefore decide to lie even when lying has good consequences. Coming up with heuristics like "don't lie, unless you have a really good reason" or even "don't lie, even if you think you have a really good reason" is still something you do with the sole intent of improving the consequences. It is therefore act utilitarianism.
0DaFranker
Just to see if I'm following correctly: If I want to follow the rule "optimize actions for some utility function X", rule consequentialism says I do this because of the result of the utility function X, and my terminal value is X()¹, which I am trivially doing better decisionmaking for by using the aforementioned rule. On the other hand, deontology says that I'm following that rule because X itself is good, regardless of whether I value X() or not. This may be because that is simply how human brains are programmed and that is what they do, or by some philosophically-vague decree from higher powers, or something else, but the key point being that X() is completely irrelevant? ---------------------------------------- 1) Programmer slang. If I say my value is "X", that means I value the function, but if I say X(), that means I value the output of the function.
Josh You490

Survey taken. Defected since I'm neutral as to whether the money goes to Yvain or a random survey-taker, but would prefer the money going to me over either of those two.

1christopherj
It seems that the fate of the prize money is having a huge effect on people's choice to cooperate or defect. Yavin could modify the numbers by some potentially large percentage by offering to either donate the remainder of the prize to a charity, or to do something near-equivalent to burning it. I chose to cooperate because the good feelings are worth more to me than a fraction of a cent, and I expect people to prefer cooperation even if it is the anti-game theory response.