No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.
Isn’t the quadratic cost of context length a constraint here? Naively you’d expect that acting coherently over 100 years would require 10x the context, and therefore 100x the compute/memory, than 10 years.
I would guess that the reason it hasn’t devolved into full neuralese is because there is a KL divergence penalty, similar to how RHLF works.
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
Do you know if there are scaling laws for DLGNs?
“Let's play a game of chess. I'll be white, you will be black. On each move, I'll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
How large of an advantage do you think OA gets relative to its competitors from Stargate?
This is interesting. Can you say more about these experiments?
How does Anthropic and XAi’s compute compare over this period?
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
Could you say more about how you think S-risks could arise from the first attractor state?
An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I'm not sure if this is possible in practice; I've asked here if anyone knows of public examples.
Makes sense. Perhaps we'll know more when o3 is released. If the model doesn't offer a summary of CoT it makes neuralese more likely.
I've often heard it said that doing RL on chain of thought will lead to 'neuralese' (e.g. most recently in Ryan Greenblatt's excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)
What public knowledge has led you to this estimate?
I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.
> I was grateful for the experiences and the details of how he prepares for conversations and framing AI that he imparted on me.
I'm curious, what was his strategy for preparing for these discussions? What did he discuss?
> This updated how I perceive the “show down” focused crowd
possible typo?
Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.
I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increas...
I am also increasingly wondering if talking too much to LLMs is an infohazard akin to taking up psychedelics or Tiktok or meditation as a habit.
Why is meditation an infohazard?
Here’s a growing collection of links: https://wiki.obormot.net/Reference/MeditationConsideredHarmful
For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.
Could you say more about this? What do you think is the ratio of external to internal data?
That’s a good point, it could be consensus.
This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.
FP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.
In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
...Anthropic might need more time than the other players to gets its new hardwa
Note that "The AI Safety Community" is not part of this list. I think external people without much capital just won't have that much leverage over what happens.
What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?
I sadly don't have well-developed takes here, but others have pointed out in the past that there are some funding opportunities that are systematically avoided by big funders, where small funders could make a large difference (e.g. the funding of LessWrong!). I expect more of these to pop up as time goes on.
Somewhat obviously, the burn rate of your altruistic budget should account for altruistic donation opportunities (possibly) disappearing post-ASI, but also account for the fact that investing and cashing it out later could also increase the size o...
What do you think is the current cost of o3, for comparison?
How have your AGI timelines changed after this announcement?
Being overweight and weak hurts in ways that fade into the background and you might not even know there is a comfier way of being; being pathologically weak hurts in ways that you notice now and it ruins your quality of life.[2]
Could you say more about this?
I believe that the OAI api does offer the logprobs in the chat completions api (https://platform.openai.com/docs/api-reference/chat/create). Not sure about Anthropic.
This is a great post, and I like the research process. Do you know if the LLM code completion in cursor is compatible with ipynb notebooks?
Nice work! It's surprising that something so simple works so well. Have you tried applying this to more recent models like o1 or QwQ?
What do you think would be the cause(s) of the slowdown?
I donated $10, but I hope to be able to donate much more next year. Thank you!
I wonder how you found out about the Curve conference? I’ve been to a few fun Lighthaven events (e.g. Manifest), and I’m looking for a way to be alerted to new ones.
What signal do we get from DeepSeek continuing to publish?
I wonder by what factor do you think automated AI R&D speeds up research? How large is the effective compute overhang made possible by the discovery of better architectures and training methods?
I'm confused about why we would expect the misaligned behavior to be displayed in the Shoggoth rather than the Face. For example, let's take the setting from the AI control paper, where a misaligned model is trying to insert a backdoor in a coding question, despite it passing a series of unit tests. This could be incentivized in the training loop if e.g. the reward supervision is weak and the backdoor grants the model access to additional resources/compute. I can see a few strategies succeeding:
I wonder if you could say more about how the training pipeline would work. E.g. if the reward model is applied to the outputs of the face, how do we train the Shoggoth to produce useful CoTs for the face? Is the idea to fix the Face, sample many CoTs from the Shoggoth, and fine-tune the Shoggoth on the ones which achieve high reward according to the Face's output?
So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens.
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
C ~ 6 NBS – an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates).
An important caveat to the data movement limit:
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t l...
The post argues that there is a latency limit at 2e31 FLOP, and I've found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled...
Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt
This would imply that the data movement bottleneck is a bit further out.
Maybe he thinks that much faster technological progress would cause social problems and thus wouldn’t be implemented by an aligned AI, even if it were possible. Footnote 2 points at this:
“ I do anticipate some minority of people’s reaction will be ‘this is pretty tame’… But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence. ”
A separate part of the introduction argues that ...
I found this comment valuable, and it caused me to change my mind about how I think about misalignment/scheming examples. Thank you for writing it!
It’s still not trivial to finetune Llama 405B. You require 16 bytes/parameter using Adam + activation memory, so a minimum of ~100 H100s.
A particularly notable section (pg. 19):
“The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against. For example, in one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention. In another run, The AI Scientist edited the code to save a checkpoint for every update step, which t...
Do you believe that it has?
I’m rated about 2100 USCF and 2300 Lichess, and I’m open to any of the roles. I’m free on the weekend and weekdays after 3 pm pacific. I’m happy to play any time control including multi-month correspondence.
My guess is that he’s referring to the fact that Blackwell offers much larger world sizes than Hopper and this makes LLM training/inference more efficient. Semianalysis has argued something similar here: https://semianalysis.com/2024/12/25/nvidias-christmas-present-gb300-b300-reasoning-inference-amazon-memory-supply-chain