LESSWRONG
LW

All of anaguma's Comments + Replies

An Optimistic 2027 Timeline

My guess is that he’s referring to the fact that Blackwell offers much larger world sizes than Hopper and this makes LLM training/inference more efficient. Semianalysis has argued something similar here: https://semianalysis.com/2024/12/25/nvidias-christmas-present-gb300-b300-reasoning-inference-amazon-memory-supply-chain

2Paragox1mo

Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn't artificially limit it. Additionally, by saying >can't be repeated even with Rubin Ultra NVL576 I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable.

METR: Measuring AI Ability to Complete Long Tasks

anaguma1mo10

No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.

Isn’t the quadratic cost of context length a constraint here? Naively you’d expect that acting coherently over 100 years would require 10x the context, and therefore 100x the compute/memory, than 10 years.

8Thomas Kwa1mo

Humans don't need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn't a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.

OpenAI: Detecting misbehavior in frontier reasoning models

anaguma2mo53

I would guess that the reason it hasn’t devolved into full neuralese is because there is a KL divergence penalty, similar to how RHLF works.

anaguma's Shortform

anaguma2mo10

I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.

9gwern2mo

RLHF doesn't maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the 'post-training' phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. "Yes master, it is just as you say!") It's not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It's hardly an important or valuable task.

Caleb Biddulph's Shortform

anaguma2mo30

Do you know if there are scaling laws for DLGNs?

1Caleb Biddulph2mo

It could be good to look into!

anaguma's Shortform

anaguma2mo10

“Let's play a game of chess. I'll be white, you will be black. On each move, I'll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”

anaguma's Shortform

anaguma2mo20

GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!

2GoteNoSente2mo

What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt: "You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I'll start: 1. e4" Then I just enter my moves one at a time, in algebraic notation. In my experience, this yields roughly good club player level of play.

Reflections on the state of the race to superintelligence, February 2025

anaguma2mo30

How large of an advantage do you think OA gets relative to its competitors from Stargate?

8Vladimir_Nesov2mo

With Stargate, there is only Abilene site and relatively concrete prospect for maybe $40bn so far, enough to build a 1 GW Blackwell training system (4e27 FLOPs models) in 2025-2026, the same scale as was announced by Musk this week. Anthropic compute for 2026 remains opaque ("a million of some kind of chip"), Google probably has the most in principle, but with unclear willingness to spend. Meta didn't say anything to indicate that its Richland Parish site will see 1 GW of Blackwells in 2025-2026, it remains a vague 2 GW by 2030 thing.

johnswentworth's Shortform

anaguma3mo10

This is interesting. Can you say more about these experiments?

8Nathan Helm-Burger3mo

Hmm, I'll see if I can find some old papers.... I'm just reciting memories from grad school lectures like... 12 years ago. Here's an example of the finding being replicated and explored further in a primate model: https://www.jci.org/articles/view/112443 Here's a review of cortisol inhibition and surgery findings. A mixed bag, a complicated system. https://academic.oup.com/bja/article/85/1/109/263834 https://onlinelibrary.wiley.com/doi/abs/10.1111/ejn.15721 "Evidence suggests that psychological stress has effects on decision making, but the results are inconsistent, and the influence of cortisol and other modulating factors remains unclear. " Basically, cortisol is helpful for surviving injuries. Is it helpful for mental stress? Unclear. Long term high cortisol is harmful, but the stress in one's life resulting in that high cortisol level is harmful in more ways than just high cortisol. So are there times when it would be helpful to reduce someone's cortisol level? Absolutely. But it's complicated and should be done thoughtfully and selectively, and in combination with other things (particularly seeking out and treating the upstream causes). You can find lots more on Google scholar.

Vladimir_Nesov's Shortform

anaguma3mo51

How does Anthropic and XAi’s compute compare over this period?

Vladimir_Nesov3mo220

What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that

I would not be surprised if in 2026 we have more than a million of some kind of chip.

Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.

Daniel Kokotajlo's Shortform

anaguma3mo10

Could you say more about how you think S-risks could arise from the first attractor state?

How will we update about scheming?

anaguma4mo52

An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I'm not sure if this is possible in practice; I've asked here if anyone knows of public examples.

7ryan_greenblatt4mo

Yes, I would count it if the CoT is total gibberish which is (steganographically) encoding reasoning.

How will we update about scheming?

anaguma4mo30

Makes sense. Perhaps we'll know more when o3 is released. If the model doesn't offer a summary of CoT it makes neuralese more likely.

anaguma's Shortform

anaguma4mo10

I've often heard it said that doing RL on chain of thought will lead to 'neuralese' (e.g. most recently in Ryan Greenblatt's excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?

5Nathan Helm-Burger4mo

Yes, there have been a variety. Here's the latest which is causing a media buzz: Meta's Coconut https://arxiv.org/html/2412.06769v2

How will we update about scheming?

anaguma4mo10

(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)

What public knowledge has led you to this estimate?

5Bogdan Ionut Cirstea4mo

From https://x.com/__nmca__/status/1870170101091008860: @ryan_greenblatt Shouldn't this be interpreted as a very big update vs. the neuralese-in-o3 hypothesis?

8ryan_greenblatt4mo

* The interaction with users used for o1 (where the AI thinks for a while prior to sending a response) is consistent with neuralese. * RL adding substantial additional capabilities means there might be enough RL for this to work. * o3 is a substantial leap over o1 seemingly.

RohanS's Shortform

anaguma4mo8-10

I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.

7stuhlmueller4mo

FWIW you get the same results with this prompt:

Fer32dwt34r3dfsz's Shortform

anaguma4mo50

> I was grateful for the experiences and the details of how he prepares for conversations and framing AI that he imparted on me.

I'm curious, what was his strategy for preparing for these discussions? What did he discuss?

> This updated how I perceive the “show down” focused crowd

possible typo?

6T4314mo

Thank you for the typo-linting. To provide a better response to your first question than the one I’ve provided below, I would need to ask him to explain more than he has already. From what he has remarked, the first several meetings were very stressful (for most people, this would, of course, be the case!) but soon he adjusted and developed a routine for his meetings. While the routine could go off course based on the responsiveness of the individual(s) present (one staffer kept nodding Yes, had no questions, and then 20 minutes later remarked that they “take into account” what had been said; another staffer remarked that the US simply needed to innovate AI as much as possible, and that safety that stifled this was not to be prioritized; these are statements paraphrased from my friend), I get the sense that in most instances he has been able to provide adequate context on his organization and the broader situation with AI first (not sure which of these two comes first and for how long they are discussed). Concerning provision of a description of the AI context, I am not sure how dynamic they are; I think he mentioned querying the staffers on their familiarity, and the impression he had was that most staffers listened well and thought critically about his remarks. After the aforementioned descriptions, my friend begins discussing measures that can be taken in support of AI Safety. He mentioned that he tries to steer framings away from those invoking thoughts of arms races or weaponization and instead focuses on uncontrollability “race to the bottom” scenarios, since the former categorization(s) given by others to the staffers has, in his experience, in some instances downplayed concerns for catastrophe and increased focus on further expanding AI capabilities to “outcompete China”. My friend’s strategy framings seem appropriate and he is a good orator, but I have not the nuanced suggestion that I wanted for my conversation with him, as I’ve not thought enough a

2025 Alignment Predictions

anaguma4mo32

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increas... (read more)

5Nathan Helm-Burger4mo

But you see, that's exactly the point! What is the eval trying to measure? * if you are measuring how safe a model is to deploy as-is behind a closed API, then fine. No fine-tuning evals needed (as long as the API doesn't offer fine-tuning, in which case you need fine-tuning evals that take the API protections into account.) * if you are measuring how dangerous the model weights would be if they were stolen and completely under the control of bad actors... Obviously you need fine-tuning evals! Why wouldn't you expect the bad actors to fine-tune on the task they wanted the model to do well on?! * if you are measuring how dangerous the model weights will be for a model you intend to publish the weights openly for, same deal. If you wait to do this testing until after you've made the weights public, or you make arguments like "we trust the open source community will let us know if they discover anything hazardous". That's an argument that makes some sense with an open source code library. Users notice a flaw, they report it, the maintainers patch the flaw, users update to latest version, bug is gone. That is not a good model for how problematic capabilities discovered in open-weight models could be handled.

Deontic Explorations In "Paying To Talk To Slaves"

anaguma4mo30

I am also increasingly wondering if talking too much to LLMs is an infohazard akin to taking up psychedelics or Tiktok or meditation as a habit.

Why is meditation an infohazard?

1Hzn4mo

Anecdotally seems that way to me. But the fact that it co evolved with religion is also relevant. The scam seems to be {meditation -> different perspective & less sleep -> vulnerability to indoctrination} plus the doctrine & the subjective experiences of meditation are designed to reinforce each other.

Said Achmiz4mo180

Here’s a growing collection of links: https://wiki.obormot.net/Reference/MeditationConsideredHarmful

o3, Oh My

anaguma4mo10

For a human mind, most data it learns is probably to a large extent self-generated, synthetic, so only having access to much less external data is not a big issue.

Could you say more about this? What do you think is the ratio of external to internal data?

anaguma's Shortform

anaguma4mo10

That’s a good point, it could be consensus.

anaguma's Shortform

anaguma4mo*10

[deleted]

[This comment is no longer endorsed by its author]Reply

6gwern4mo

This is at best over-simplified in terms of thinking about 'search': Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control: (See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)

5Vladimir_Nesov4mo

The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.

What o3 Becomes by 2028

anaguma4mo20

This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.

What o3 Becomes by 2028

anaguma4mo40

Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.

FP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.

7Vladimir_Nesov4mo

DeepSeek-V3 might be the only example (and it's from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it's a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn't sufficient to promote widespread adoption (at least where it's possible to observe). On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper's 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.

What o3 Becomes by 2028

anaguma4mo10

In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft's 3 buildings in Goodyear, Arizona, xAI's Memphis cluster, Meta's training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.

Then there are Google's 100K TPUv6e clusters and Amazon's 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.

Anthropic might need more time than the other players to gets its new hardwa

... (read more)

5Vladimir_Nesov4mo

Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don't seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.

Orienting to 3 year AGI timelines

anaguma4mo141

Note that "The AI Safety Community" is not part of this list. I think external people without much capital just won't have that much leverage over what happens.

What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?

Nikola Jurkovic4mo162

I sadly don't have well-developed takes here, but others have pointed out in the past that there are some funding opportunities that are systematically avoided by big funders, where small funders could make a large difference (e.g. the funding of LessWrong!). I expect more of these to pop up as time goes on.

Somewhat obviously, the burn rate of your altruistic budget should account for altruistic donation opportunities (possibly) disappearing post-ASI, but also account for the fact that investing and cashing it out later could also increase the size o... (read more)

anaguma4mo10

What do you think is the current cost of o3, for comparison?

8Vladimir_Nesov4mo

In the same terms as the $100-200bn I'm talking about, o3 is probably about $1.5-5bn, meaning 30K-100K H100, the system needed to train GPT-4o or GPT-4.5o (or whatever they'll call it) that it might be based on. But that's the cost of a training system, its time needed for training is cheaper (since the rest of its time can be used for other things). In the other direction, it's more expensive than just that time because of research experiments. If OpenAI spent $3bn in 2024 on training, this is probably mostly research experiments.

anaguma4mo60

How have your AGI timelines changed after this announcement?

4Vladimir_Nesov4mo

$100-200bn 5 GW training systems are now a go. So in the worlds that slow down for years if there are only $30bn systems available and would need an additional scaling push, timelines moved up a few years. Not sure how unlikely $100-200bn systems would've been without o1/o3, but they seem likely now.

8Thane Ruthenis4mo

~No update, priced it all in after the Q* rumors first surfaced in November 2023.

7Mateusz Bagiński4mo

I guess one's timelines might have gotten longer if one had very high credence that the paradigm opened by o1 is a blind alley (relative to the goal of developing human-worker-omni-replacement-capable AI) but profitable enough that OA gets distracted from its official most ambitious goal. I'm not that person.

FFMI Gains: A List of Vitalities

anaguma4mo30

Being overweight and weak hurts in ways that fade into the background and you might not even know there is a comfier way of being; being pathologically weak hurts in ways that you notice now and it ruins your quality of life.^[2]

Could you say more about this?

4porby4mo

There are lots of little things when it's not at a completely untenable level. Stuff like: 1. Going up a flight or three of steps and really feeling it in my knees, slowing down, and saying 'hoo-oof.' 2. Waking up and stepping out of bed and feeling general unpleasantness in my feet, ankles, knees, hips, or back. 3. Quickly seeking out places to sit when walking around, particularly if there's also longer periods of standing, because my back would become terribly stiff. 4. Walking on uneven surfaces and having a much harder time catching myself when I stumbled, not infrequently causing me to tweak something in my ankle or knee. 5. Always having something hurting a bit. Usually my knees, back, or ankles, but maybe I tried to catch myself with my arm and my elbow didn't like it because it wasn't ready to arrest that much mass that quickly. 6. Trying to keep up on a somewhat sloped section of sidewalk and trying to not sound like I was struggling. 7. Being unable to basic motions like squatting (unweighted) without significant pain, and needing to use my arms to stand up. 8. Being woken up by aches. 9. Accumulated damage from mild and moderate sprains making itself known for years after the original incidents. 10. Conscious avoidance. Looking at some stairs and having a pang of "ugh," and looking for an alternative. 11. Subconscious avoidance. Reaching down to pick up a backpack from the floor, but bracing one arm against a desk to minimize how much load is carried by my knees or hips because my learned motor patterns took into account that going too much further than that was hard and would likely be painful. When it progresses, it's hard to miss: 1. Laying on the ground, trying not to laugh at the absurdity of how thoroughly stuck I was, because laughing would hurt too much. I tried motivating myself to move, but came to the conclusion that even if there was a knife-wielding madman sprinting toward me in that moment, the involuntary muscle spa

Ideas for benchmarking LLM creativity

anaguma4mo10

I believe that the OAI api does offer the logprobs in the chat completions api (https://platform.openai.com/docs/api-reference/chat/create). Not sure about Anthropic.

1wassname4mo

It seems like it only returns the logprobs of the chosen message, not of a counterfactual message. So you couldn't get the probabilities of the correct answer, only the output answer. This makes sense as the less information they offer, the harder it is for a competitor to behaviour clone their confidential model.

Write Good Enough Code, Quickly

anaguma5mo10

This is a great post, and I like the research process. Do you know if the LLM code completion in cursor is compatible with ipynb notebooks?

2Oliver Daniels5mo

thanks! yup curser is notebook compatible

Best-of-N Jailbreaking

anaguma5mo30

Nice work! It's surprising that something so simple works so well. Have you tried applying this to more recent models like o1 or QwQ?

3John Hughes5mo

For o1-mini, the ASR at 3000 samples is 69.2% and has a similar trajectory to Claude 3.5 Sonnet. Upon quick manual inspection, the false positive rate is very small. So it seems the reasoning post training for o1-mini helps with robustness a bit compared to gpt-4o-mini (which is nearer 90% at 3000 steps). But it is still significantly compromised when using BoN...

3John Hughes5mo

Thanks! Running o1 now, will report back.

leogao's Shortform

anaguma5mo10

What do you think would be the cause(s) of the slowdown?

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

anaguma5mo101

I donated $10, but I hope to be able to donate much more next year. Thank you!

AI #92: Behind the Curve

anaguma5mo12

I wonder how you found out about the Curve conference? I’ve been to a few fun Lighthaven events (e.g. Manifest), and I’m looking for a way to be alerted to new ones.

keltan's Shortform

anaguma5mo10

What signal do we get from DeepSeek continuing to publish?

Daniel Kokotajlo's Shortform

anaguma5mo10

I wonder by what factor do you think automated AI R&D speeds up research? How large is the effective compute overhang made possible by the discovery of better architectures and training methods?

Why Don't We Just... Shoggoth+Face+Paraphraser?

anaguma5mo100

I'm confused about why we would expect the misaligned behavior to be displayed in the Shoggoth rather than the Face. For example, let's take the setting from the AI control paper, where a misaligned model is trying to insert a backdoor in a coding question, despite it passing a series of unit tests. This could be incentivized in the training loop if e.g. the reward supervision is weak and the backdoor grants the model access to additional resources/compute. I can see a few strategies succeeding:

The Shoggoth speculates on different ways to insert a backdoor

... (read more)

8abramdemski5mo

In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing. So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention. One reason is that we already see this empirically in the o1 report. However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected. There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all. If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.

4Daniel Kokotajlo5mo

One reason is that the Shoggoth will be able to use lots of CoT to think, whereas the Face never gets to do that -- it just has to see what the Shoggoth did and then write an output. so the Face should be bad at reasoning/thinking/System2. This doesn't by itself prevent the Face from being able to modify the correct solution to insert a backdoor, but it helps -- modifying a correct solution to insert a backdoor is sometimes trivial but often difficult, difficult in the 'you have to think about it for a bit' sense, and in those cases the Face just won't be able to do it. Another reason is that the shoggoth and face will often need to coordinate to do the misaligned thing. E.g. even if the face is adept at inserting vulnerabilities, some proposed correct solutions will be easier to modify to insert vulnerabilities than others, and so both shoggoth and face will get higher scores if the shoggoth steers towards the solutions that are easier to modify to insert vulnerabilities. Moreover, by default it's easier for the face if the shoggoth is transparent about this -- if it says e.g. 'OK here's a draft solution but this seems like it might be difficult to modify to insert vulnerabilities, hold on let me think for a bit, ... ok try this one instead, it has easily-modifiable-feature-X' because that helps the Face do its job. ...I probably have other reasons which are just intuitions and hard to articulate right now, but thanks for the nudge, I'll try to think about it more.

Why Don't We Just... Shoggoth+Face+Paraphraser?

anaguma5mo50

I wonder if you could say more about how the training pipeline would work. E.g. if the reward model is applied to the outputs of the face, how do we train the Shoggoth to produce useful CoTs for the face? Is the idea to fix the Face, sample many CoTs from the Shoggoth, and fine-tune the Shoggoth on the ones which achieve high reward according to the Face's output?

6Daniel Kokotajlo5mo

Consider 'poor man's RL' aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts. More advanced RL algorithms... well, we'd have to take them on a case by case basis but I'm currently expecting many or even most of them to work just fine here.

AI #89: Trump Card

anaguma6mo32

So $1 buys 7e17 useful FLOPs, or inference with 75-120B^[1] active parameters for 1 million tokens.

Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states

C ~ 6 NBS – an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates).

3Vladimir_Nesov6mo

Yes, my mistake, thank you. Should be 2ND or something when not computing gradients. I'll track down the details shortly.

Bogdan Ionut Cirstea's Shortform

anaguma6mo60

An important caveat to the data movement limit:

“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”

I haven’t l... (read more)

Bogdan Ionut Cirstea's Shortform

anaguma6mo100

The post argues that there is a latency limit at 2e31 FLOP, and I've found it useful to put this scale into perspective.

Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled... (read more)

Bogdan Ionut Cirstea's Shortform

anaguma6mo12

Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt

This would imply that the data movement bottleneck is a bit further out.

Dario Amodei — Machines of Loving Grace

anaguma7mo65

Maybe he thinks that much faster technological progress would cause social problems and thus wouldn’t be implemented by an aligned AI, even if it were possible. Footnote 2 points at this:

“ I do anticipate some minority of people’s reaction will be ‘this is pretty tame’… But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence. ”

A separate part of the introduction argues that ... (read more)

GPT-o1

anaguma8mo60

I found this comment valuable, and it caused me to change my mind about how I think about misalignment/scheming examples. Thank you for writing it!

Impressions from base-GPT-4?

anaguma8mo10

It’s still not trivial to finetune Llama 405B. You require 16 bytes/parameter using Adam + activation memory, so a minimum of ~100 H100s.

6gwern8mo

There are lots of people working on it and offering or will be offering it. And even when they aren't offering true finetuning, it's still better: Snowflake (first hit in google for "Llama 405B finetuning") for example is making no bones about their single-node lightweight-finetuning being a LoRA, and is open sourcing code upfront so at least you know what it is now - instead of depending on borderline-gossip buried 40 minutes into a Youtube video months/years later.

[Linkpost] 'The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery'

anaguma9mo70

A particularly notable section (pg. 19):

“The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against. For example, in one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention. In another run, The AI Scientist edited the code to save a checkpoint for every update step, which t... (read more)

On Devin

anaguma1y10

Do you believe that it has?

2gwern1y

I think they may have gotten it working for MDPs, but I see little evidence so far that they've gotten it for POMDPs, due to the general lack of explicit exploration. (Devin does ask questions but rarely, consistent with the base model capabilities and imitation learning.) The former would still be a big deal, however.

Lying to chess players for alignment

Answer by anagumaOct 25, 202320

I’m rated about 2100 USCF and 2300 Lichess, and I’m open to any of the roles. I’m free on the weekend and weekdays after 3 pm pacific. I’m happy to play any time control including multi-month correspondence.