All of Steven Byrnes's Comments + Replies

I think the entire crux is that all of those robots/solar cell chips you referenced currently depend on human industry/modern civilization to actually work, and they'd quickly degrade and become non-functional on the order of weeks or months if modern civilization didn't exist, and this is arguably somewhat inevitable due to economics (until you can have tech that obviates the need for long supply chains).

OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar pa... (read more)

Right, there’s a possible position which is: “I’ll accept for the sake of argument your claim there will be an egregiously misaligned ASI requiring very little compute (maybe ≲1 chip per human equivalent including continuous online learning), emerging into a world not terribly different from today’s. But even if so, that’s OK! While the ASI will be a much faster learner than humans, it will not magically know things that it has no way to have figured out (§1.8.1), and that includes developing nanotechnology. So it will be reliant on humans and human infras... (read more)

2Noosphere89
Basically this, and in particular I'm willing to grant the premise that for the sake of argument there is technology that eliminates the need for most logistics, but that all such technology will take at least a year or more of real-world experimentation that means that the AI can't immediately take over. On this: I think the entire crux is that all of those robots/solar cell chips you referenced currently depend on human industry/modern civilization to actually work, and they'd quickly degrade and become non-functional on the order of weeks or months if modern civilization didn't exist, and this is arguably somewhat inevitable due to economics (until you can have tech that obviates the need for long supply chains). And in particular, in most takeover scenarios where AIs don't automate the economy first, I don't expect AIs to be able to keep producing robots for a very long time, and I'd bump it up to 300-3,000 years at minimum because there is less easily accessible resources combined with AIs being much less capable due to having very little compute relative to modern civilization. In particular, I think that disrupting modern civilization to a degree such that humans are disempowered (assuming no tech that obviates the need for logistics) pretty much as a consequence breaks the industries/logistics needed to fuel further AI growth, because there's no more trade, which utterly fucks up modern economies. And your references argue that human civilization wouldn't go extinct very soon because of civilizational collapse, and that AIs can hack existing human industry to help them, and I do think this is correct (modulo the issue that defense is easier than offense for the cybersecurity realm specifically, and importantly, a key reason for this is that once you catch the AI doing it, there are major consequences for AIs and humans, which actually matter for AI safety): https://x.com/MaxNadeau_/status/1912568930079781015 I actually agree cyber-attacks to subvert h

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and i... (read more)

My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”

I don’t think anything about human brains and their evolution cuts against this claim.

If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be ... (read more)

4Lukas Finnveden
To be clear: I'm not sure that my "supporting argument" above addressed an objection to Ryan that you had. It's plausible that your objections were elsewhere. But I'll respond with my view. Ok, so this describes a story where there's a lot of work to get proto-AGI and then not very much work to get superintelligence from there. But I don't understand what's the argument for thinking this is the case vs. thinking that there's a lot of work to get proto-AGI and then also a lot of work to get superintelligence from there. Going through your arguments in section 1.7: * "I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above." * But I think what you wrote about the simple(ish) core of intelligence in 1.3 is compatible with there being like (making up a number) 20 different innovations involved in how the brain operates, each of which gets you a somewhat smarter AI, each of which could be individually difficult to figure out. So maybe you get a few, you have proto-AGI, and then it takes a lot of work to get the rest. * Certainly the genome is large enough to fit 20 things. * I'm not sure if the "6-ish characteristic layers with correspondingly different neuron types and connection patterns, and so on" is complex enough to encompass 20 different innovations. Certainly seems like it should be complex enough to encompass 6. * (My argument above was that we shouldn't expect the brain to run an algorithm that only is useful once you have 20 hypothetical components in place, and does nothing beforehand. Because it was found via local search, so each of the 20 things should be useful on their own.) * "Plenty of room at the top" — I agree. * "What's the rate limiter?" — The rate limiter would be to come up with the thinking and experimenting needed to find the hypothesized 20 different innovations mentioned above. (What would you get if you only had some of the innovations? Maybe AGI that's incredibly expensive. O
  • In §2.4.1 I talk about learned reward functions.
  • In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
    • I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
    • I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call... (read more)

I’m worried about treacherous turns and such, and I don’t think any of the things you mentioned are relevant to that:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.

It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.

It's possible to make the classifie

... (read more)

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

That’s not quite my position.

Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focuse... (read more)

5Stephen McAleese
Thank you for the reply! Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons: * It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator. * It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM. * It seems like there are quite a few examples of learned classifiers working well in practice: * It's hard to write spam that gets past an email spam classifier. * It's hard to jailbreak LLMs. * It's hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes. That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper: So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.  

Thanks!

This is a surprising prediction because it seems to run counter to Rich Sutton's bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I'm just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.

No, I’m also talking about “gene... (read more)

Part of the problem IMO is that both sides of that conversation seem to be mainly engaging in a debate over “yay Bengio” versus “boo Bengio”. That kind of debate (I call it “vibes-based meaningless argument”) is not completely decision-irrelevant, but IMO people in general are motivated to engage in it way out of proportion to its very slight decision-relevance†. Better to focus on what’s really decision-relevant here: mainly “what is the actual truth about reward tampering?” (and perhaps “shall I email the authors and bug them to rewrite that section?” or... (read more)

2Gunnar_Zarncke
Yes, that happens a lot. The question is then maybe a differential one: What is the responsibility of the author in a political post vs. one that tries to improve the discourse?

Good questions!

Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech?

I think you’re thinking about that kinda the wrong way around.

You’re treating “the things that Zoe does when she wants to get my attention” as a cause, and “my brain reacts to that” as the effect.

But I would say that a better perspective is: everybody’s brain reacts to various cues (sound level, pitch, typical learned associations, etc.), and Zoe has learned through life experience how to get a person’s attention by tapping int... (read more)

Thanks!

Hmm, here’s a maybe-interesting example (copied from other comment):

If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me. 

What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical... (read more)

3Jeremy Gillen
I agree that goals like this work well with self-modification and successors. I'd be surprised if Eliezer didn't. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer's views that is leading you to infer that he believes AIs must only have preferences over the distant future. But I can't tell what it is. One guess is: to you, corrigibility only looks hard/unnatural if preferences are very strictly about the far future, and otherwise looks fairly easy. I would still call those preferences consequentialist, since the consequences are the primary factor that determines the actions. I.e. the behaviour is complicated, but in a way that easy to explain once you know what the behaviour is aimed at achieving. They're even approximately long-term consequentialist, since the actions are (probably?) mostly aimed at the long-term future. The strict definition you call "pure consequentialism" is a good approximation or simplification of this, under some circumstances, like when value adds up over time and therefore the future is a bigger priority than the immediate present. No one I know has argued that AI or rational people can only care about the distant future. People spend money to visit a theme park sometimes, in spite of money being instrumentally convergent. ---------------------------------------- Some versions of that does have loopholes, but overall I think I agree that you could get a lot of stability that way. (But as far as I can tell, the versions with fewer loopholes look more like consequence-based goals rather than rules that say which kinds of local actions-sequences are good and bad). Yeah this is exactly what I had an issue with in my sibling discussion with Ryan. He seems to think {integrity,honesty,loyalty} are deontological, whe

I would think that reproducing cortical learning would require a good deal of work and experimentation, and I wouldn't expect working out the "algorithm" to happen all at once

I agree; working out the "algorithm" is already happening, and has been for decades. My claim instead is that by the time you can get the algorithm to do something importantly useful and impressive—something that LLMs and deep learning can’t already do much cheaper and better—then you’re almost at ASI. Note that we have not passed this threshold yet (no offense). See §1.7.1.

or to be v

... (read more)

In §1.8.1 I also mentioned going from zero to beyond-world-expert-level understanding of cryptocurrency over the course of 24 hours spent reading up on the topic and all its prerequisites, and playing with the code, etc.

And in §3.2 here I talked about a hypothetical AI that, by itself, is running the equivalent of the entire global human R&D enterprise, i.e. the AI is running hundreds of thousands of laboratory experiments simultaneously, 24 hours a day, publishing millions of papers a year, and synthesizing brilliant new insights in every field at onc... (read more)

1pmarc
With regards to the super-scientist AI (the global human R&D equivalent), wouldn't we see it coming based on the amount of resources it would need to hire? Are you claiming that it could reach the required AGI capacity in its "brain in a box in a basement" state and only after scale up in terms of resource use? The part I'm most skeptical about remains this idea that the resource use to get to human-level performance is minimal if you just find the right algorithm, because at least in my view it neglects the evaluation step in learning that can be resource intensive from the start and maybe can't be done "covertly". --- That said, I want to stress that I agree with the conclusion:   But then, if AI researchers believe a likely scenario is: Does that imply that the people who work on technical alignment, or at least their allies, need to also put effort to "win the race" for AGI? It seems the idea that "any small group could create this with no warning" could motivate acceleration in that race even from people who are well-meaning in terms of alignment. 

Thanks! Here’s a partial response, as I mull it over.

Also, I'd note that the brain seems way more complex than LLMs to me!

See “Brain complexity is easy to overstate” section here.

basically all paradigms allow for mixing imitation with reinforcement learning

As in the §2.3.2, if an LLM sees output X in context Y during pretraining, it will automatically start outputting X in context Y. Whereas if smart human Alice hears Bob say X in context Y, Alice will not necessarily start saying X in context Y. Instead she might say “Huh? Wtf are you talking about Bob?”

L... (read more)

5ryan_greenblatt
Sure, but I still think it's probably more way more complex than LLMs even if we're just looking at the parts key for AGI performance (in particular, the parts which learn from scratch). And, my guess would be that performance is substantially greatly degraded if you only take only as much complexity as the core LLM learning algorithm. This isn't really what I'm imagining, nor do I think this is how LLMs work in many cases. In particular, LLMs can transfer from training on random github repos to being better in all kinds of different contexts. I think humans can do something similar, but have much worse memory. I think in the case of humans and LLMs, this is substantially subconcious/non-explicit, so I don't think this is well described as having a shoulder Bob. Also, I would say that humans do learn from imitation! (You can call it prediction, but it doesn't matter what you call it as long as it implies that data from humans makes things scale more continuously through the human ragne.) I just think that you can do better at this than humans based on the LLM case, mostly because humans aren't exposed to as much data. Also, I think the question is "can you somehow make use of imitation data" not "can the brain learning algorithm immediately use of imitation"? Notably this analogy implies LLMs will be able to automate substantial fractions of human work prior to a new paradigm which (over the course of a year or two and using vast computational resources) beats the best humans. This is very different from the "brain in a basement" model IMO. I get that you think the analogy is imperfect (and I agree), but it seems worth noting that the analogy you're drawing suggests something very different from what you expect to happen. It's substantially proprietary, but you could consider looking at the Deepseek V3 paper. We don't actually have great understanding of the quantity and nature of algorithmic improvment after GPT-3. It would be useful for someone to do a more

The second line is about learning speed and wall-clock time. Of course AI can communicate and compute orders of magnitude faster than humans, but there are other limiting factors to learning rate. At some point, the AI has to go beyond the representations that can be found or simulated within the digital world and get its own data / do its own experiments in the outside world.

Yup, I addressed that in §1.8.1:

To be clear, the resulting ASI after those 0–2 years would not be an AI that already knows everything about everything. AGI and ASI (in my opinion

... (read more)
1pmarc
Maybe the problem is that we don't have a good metaphor for what the path for "rapidly shooting past human-level capability" is like in a general sense, rather than on a specific domain. One domain-specific metaphor you mention is AlphaZero, but games like chess are an unusual domain of learning for the AI, because it doesn't need any external input beyond the rules of the game and objective, and RL can proceed just by the program playing against itself. It's not clear to me how we can generalize the AlphaZero learning curve to problems that are not self-contained games like that, where the limiting factor may not be computing power or memory, but just the availability (and rate of acquisition) of good data to do RL on.  

Cool paper, thanks!

training a group of LLMs…

That arxiv paper isn’t about “LLMs”, right? Really, from my perspective, the ML models in that arxiv paper have roughly no relation to LLMs at all.

Is this a load-bearing part of your expectation of why transformer-based LLMs will hit a scaling wall?

No … I brought this up to make a narrow point about imitation learning (a point that I elaborate on much more in §2.3.2 of the next post), namely that imitation learning is present and very important for LLMs, and absent in human brains. (And that arxiv paper is unrela... (read more)

My current take is that the sandwich thing is such a big problem that it sinks the whole proposal. You can read my various comments on their lesswrong cross-posts: 1, 2 

Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.

For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is... (read more)

I don’t think GPUs would be the best of all possible chip designs for the next paradigm, but I expect they’ll work well enough (after some R&D on the software side, which I expect would be done early on, during the “seemingly irrelevant” phase, see §1.8.1.1). It’s not like any given chip can run one and only one algorithm. Remember, GPUs were originally designed for processing graphics :) And people are already today running tons of AI algorithms on GPUs that are not deep neural networks (random example).

Can you expand your argument why LLM will not reach AGI?

I’m generally not very enthusiastic about arguing with people about whether LLMs will reach AGI.

  • If I’m talking to someone unconcerned about x-risk, just trying to make ASI as fast as possible, then I sure don’t want to dissuade them from working on the wrong thing (see §1.6.1 and §1.8.4).
  • If I’m talking to someone concerned about LLM x-risk, and thus contingency planning for LLMs reaching AGI, then that seems like a very reasonable thing to do, and I would feel bad about dissuading them too. After all,
... (read more)

New large-scale learning algorithms can in principle be designed by (A) R&D (research taste, small-scale experiments, puzzling over the results, iterating, etc.), or (B) some blind search process. All the known large-scale learning algorithms in AI to date, from the earliest Perceptron to the modern Transformer, have been developed by (A), not (B). (Sometimes a few hyperparameters or whatever are set by blind search, but the bulk of the real design work in the learning algorithm has always come from intelligent R&D.) I expect that to remain the cas... (read more)

1Knight Lee
To be honest I'm very unsure about all of this. I agree that (B) never happened. Another way of saying this, is that "algorithms for discovering algorithms" have only ever been written by humans, and never directly discovered by another "algorithm for discovering algorithms." The LLM+RL "algorithm for discovering algorithms" is far less powerful than the simple core of intelligence, but far more powerful than any other "algorithm for discovering algorithms" we ever had before. Since it has discovered the algorithms for solving IMO level math problems. Meanwhile, the simple core of intelligence may also be the easiest "algorithm for discovering algorithms" to discover (by another such algorithm). This is because evolution found it (and the entire algorithm fits inside the human genome), and the algorithm seems to be simple. The first time (B) happens, may be the only time (B) happens (before superintelligence). I think it's both plausible that the simple core of intelligence is found by human researchers, and that it just emerges inside a LLM with much greater effective scale (due to being both bigger and more efficient), subject to much greater amounts of chain-of-thought RL.

I definitely don’t think we’ll get AGI by people scrutinizing the human genome and just figuring out what it’s doing, if that’s what you’re implying. I mentioned the limited size of the genome because it’s relevant to the complexity of what you’re trying to figure out, for the usual information-theory reasons (see 1, 2, 3).  “Machinery in the cell/womb/etc.” doesn’t undermine that info-theory argument because such machinery is designed by the genome. (I think the epigenome contains much much less design information than the genome, but someone can tel... (read more)

I think your comment is poorly worded, in that you’re stating certain trend extrapolations as facts rather than hypotheses. But anyway, yes my position is that LLMs (including groups of LLMs) will be unable to autonomously write a business plan and then found a company and grow it to $1B/year revenue, all with zero human intervention, 25 years from now.

6Raemon
The thing I actually expect is "LLMs with lots of RL training of diverse gamelike environments and problem sets, and some algorithmic tweaks". Do you not expect that to work, or just by the time it does work, it will have evolved sufficiently beyond the current LLM paradigm, the resulting model will be better thought of as a new kind of thing?

My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.

“completely and resoundingly fucked” is mildly overstated but mostly “Yes, that's my position”, see §1.6.1, 1.6.2, 1.8.4.

It sounds like you're suggesting that inventing grammar is the convergent result of a general competency?

There are some caveats, but more-or-less, yeah. E.g. the language-processing parts of the cortex look pretty much the same as every other part of the neocortex. E.g. some people talk about how language is special because it has “recursion”, but in fact we can also handle “recursion” perfectly well in vision (e.g. we can recognize a picture inside a picture), planning (e.g. we can make a plan that incorporates a sub-plan), etc.

Yeah to some extent, although that can be a motivation problem as well as a capability problem. Depends on how large is the “large scale project”.

I think almost all humans can and do “autonomously” execute projects that are well beyond today’s LLMs. I picked a hard example (founding and growing a company to $1B/year revenue) just for clarity.

Random website says 10% of the USA workforce (and 50% of the global workforce!?) is self-employed.

I think a big difference between functional organizations and dysfunctional bureaucracies is that the employees at funct... (read more)

Thanks! It’s a bit hard for me to engage with this comment, because I’m very skeptical about tons of claims that are widely accepted by developmental psychologists, and you’re not.

So for example, I haven’t read your references, but I’m immediately skeptical of the claim that the cause of kids learning object permanence is “gradual exposure in cultural environments where adults consistently treat objects as permanent entities”. If people say that, what evidence could they have? Have any children been raised in cultural environments where adults don’t treat ... (read more)

1Jonas Hallgren
I will fold on the general point here, it is mostly the case that it doens't matter and the motivations come from the steering sub-system anyhow and that as a consqeuence it is ounfdationally different from how LLMs learn.  I'm however not certain if I agree with this point, if your in a fully cooperative game, is it your choice that you choose to cooperate? If you're an agent who uses functional or evidential decision theory and you choose to cooperate with your self in a black box prisoner's dilemma is that really a choice then?  Like your initial imitations shape your steering system to some extent and so there could be culturally learnt social drives no? I think culture might be conditioning the intial states of your learning environment and that still might be an important part of how social drives are generated?  I hope that makes sense and I apologise if it doesn't.

your argument here is modus tollens…

From my perspective you’re being kinda nitpicky, but OK sure, I have now reworded from:

  • “Remember, if the theories were correct and complete, the corresponding simulations would be able to do all the things that the real human cortex can do…”, to:
  • “Remember, if the theories were correct and complete, then they could be turned into simulations able to do all the things that the real human cortex can do…”

…and the “could” captures the fact that a simulation can also fail in other ways, e.g. you need to ensure adequa... (read more)

21a3orn
I think that's a characteristic of people talking about different things from within different basins of Traditions of Thought. The points one side makes seem either kinda obvious or weirdly nitpicky in a confusing and irritating way to people in the other side. Like to me, what I'm saying seems obviously central to the whole issue of high p-dooms genealogically descended from Yudkowsky, and confusions around this seem central to stories about high p-doom, rather than nitpicky and stupid. Thanks for amending though, I appreciate. :) The point about Nicaraguan Sign Language is cool as well.

I just reworded from “as a failed prediction” to “as evidence against Eliezer’s judgment and expertise”. I agree that the former was not a good summary, but am confident that the latter is what Paul intended to convey and expected his readers to understand, based on the context of disagreement 12 (which you quoted part but not all of). Sorry, thanks for checking.

I find your comment kinda confusing.

My best guess is: you thought that I was making a strong claim that there is no aspect of LLMs that resembles any aspect of human brains. But I didn’t say that (and don’t believe it). LLMs have lots of properties. Some of those LLM properties are similar to properties of human brains. Others are not. And I’m saying that “the magical transmutation of observations into behavior” is in the latter category.

Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”? ... (read more)

1S. Alex Bradt
Right! Eh, maybe "observations into predictions into sensations" rather than "observations into behavior;" and "asking if you think" rather than "saying;" and really I'm thinking more about dreams than hallucinations, and just hoping that my understanding of one carries over to the other. (I acknowledge that my understanding of dreams, hallucinations, or both could be way off!) Joey Marcellino's comment said it better, and you left a good response there.
Steven ByrnesΩ6101

Thanks! I suppose I didn’t describe it precisely, but I do think I’m pointing to a real difference in perspective, because if you ask this “LLM-focused AGI person” what exactly the R&D work entails, they’ll almost always describe something wildly different from what a human skill acquisition process would look like. (At least for the things I’ve read and people I’ve talked to; maybe that doesn’t generalize though?)

For example, if the task is “the AI needs to run a restaurant”, I’d expect the “LLM-focused AGI person” to talk about an R&D project tha... (read more)

6ryan_greenblatt
I agree there is a real difference, I just expect it to not make much of a difference to the bottom line in takeoff speeds etc. (I also expect some of both in the short timelines LLM perspective at the point of full AI R&D automation.) fMy view is that on hard tasks humans would also benefit from stuff like building explicit training data for themselves, especially if they had the advantage of "learn once, deploy many". I think humans tend to underinvest in this sort of thing. In the case of things like restaurant sim, the task is sufficiently easy that I expect AGI would probably not need this sort of thing (though it might still improve performance enough to be worth it). I expect that as AIs get smarter (perhaps beyond the AGI level) they will be able to match humans at everything without needing to do explicit R&D style learning in cases where humans don't need this. But, this sort of learning might still be sufficiently helpful that AIs are ongoingly applying it in all domains where increased cognitive performance has substantial returns. Sure, but we can still loosely evaluate sample efficiency relative to humans in cases where some learning (potentially including stuff like learning on the job). As in, how well can the AI learn from some some data relative to humans. I agree that if humans aren't using learning in some task then this isn't meaningful (and this distinction between learning and other cognitive abilities is itself a fuzzy distinction).

I was talking about existing models in the literature of what the 6ish different layers of the cortex do and how. These models are so extremely basic that it’s obvious to everyone, including their authors, that they are incomplete and basically useless, except as a step towards a future better model. I am extremely confident that there is no possible training environment that would lead a collaborative group of these crappy toy models into inventing language, science, and technology from scratch, as humans were able to do historically.

Separately, when some... (read more)

41a3orn
I feel like we're failing to communicate. Let me recapitulate. So, your argument here is modus tollens: 1. If we had a "correct and complete" version of the algorithm running in the human cortex (and elsewhere) then the simulations would be able to do all that a human can do. 2. The simulations cannot do all that a human can do. Therefore we do not, etc. I'm questioning 1, by claiming that you need a good training environment + imitation of other entities in order for even the correct algorithm for the human brain to produce interesting behavior. You respond to this by pointing out that bright, intelligent, curious children do not need school to solve problems. And this is assuredly true. Yet: bright, intelligent, curious children still learned language and an enormous host of various high-level behaviors from imitating adults; they exist in a world with books and artifacts created by other people, from which they can learn; etc, etc. I'm aware of several brilliant people with relatively minimal conventional schooling; I'm aware of no brilliant people who were feral children. Saying that humans turn into problem solving entities without plentiful examples to imitate seems simply not true, and so I remain confident that 1 is a false claim, and the point that bright people exist without school is entirely compatible with this. Maybe so, but that's a confidence that you have entirely apart from providing these crappy toy models an actual opportunity to do so. You might be right, but your argument here is still wrong. Humans did not, really, "invent" language, in the same way that Dijkstra invented an algorithm. The origin of language is subject to dispute, but it's probably something that happened over centuries or millenia, rather than all at once. So -- if you had an algorithm that could invent language from scratch, I don't think its reasonable to expect it to do so unless you give it centuries of millenia of compute, in a richly textured environment where i

When you say “the effects of RL in LLMs”, do you mean RLHF, RLVR, or both?

2AnthonyC
I hadn't intended to specify, because I'm not completely sure, and I don't expect the analogy to hold that precisely. I'm thinking there are elements of both in both analogies.

Let’s distinguish “motivation theory” (savants spend a lot of time practicing X because they find it motivating, and get really good at X) from “learning algorithm hyperparameter theory” (savants have systematically different … ML learning rates? neural architectures (e.g. fiber densities, dendrite branching properties, etc.)? loss functions? etc.). (Needless to say, these are not mutually exclusive.)

I interpret your comment as endorsing motivation theory for explaining savants. Whereas it seems to me that at least for memory savants like Kim Peek (who mem... (read more)

1Bunthut
To clarify the question: I agree that there is variation in talent and that some very talented people can do things most could never. My question is, if you look at the distribution of talent among normal people, and then check how many standard deviations  out our savant candidate is, then what's the chance at least one person with that talent would exist? Basically, is this just the normal right tail that's expected from additive genetic reshuffling, or an "X-man".

Thanks!

I can think of a few different explanations:

  • Even extreme childhood abuse doesn't have a major effect on life outcomes.
    • (Including this one for completeness though I consider it obviously implausible.)
  • The level of abuse that would affect life outcomes is rare enough not to be picked up on in the studies.
  • The methodology of the studies creates on floor on the badness of outcomes that gets picked up; e.g. maybe adoptive parents are screened well enough to make the worst abuse not happen, and the people drawn from national twin registers and contacted to
... (read more)
2Kaj_Sotala
Thanks! Hmm, I think it might be good to sharpen the context a bit more, as I feel we might be slightly talking past each other.  The argument that I'm the most focused on questioning is, to be clear, one that you haven't made and which isn't in your writings on this topic. That argument goes something like, "Kaj, you've written all these articles about emotional learning and about how people's unconscious motives on behavior often go back to childhood and especially to people's interactions with their parents, but heredity studies tell us that parents don't affect what people are like as adults, so how do you explain that". And it gets a bit subtle since there are actually several different versions of that question: 1. "Therapy books sometimes give the impression that everything about a person's life is determined based on their childhood circumstances. How do you justify that, given twin studies?" - Very fair question! Some therapy books do give that impression, and such a claim is clearly incorrect. I'm not going to defend that claim. I think it's basically a result of selection bias. The people who got lucky enough with their genes that they make it through sucky childhoods without major issues don't see therapists, and then therapists write books that draw on their clinical experience based on clients that have been selected for having unlucky genes. 2. "Okay, but even if not everything about a person's issues is determined by their childhood circumstances, the therapy books still say that stuff like parental warmth is a major factor on a person's future psychology. But wouldn't that imply a bigger shared environment effect?" - Also a very fair question, and the thing that I'm the most interested in figuring out/explaining! And I'm trying to explain that with something like "maybe parents have counterintuitively different effects on different children, and also the specific psychological issues this may cause don't necessarily map linearly to the kinds o

Like I always say, the context in which you’re bringing up heritability matters. It seems that the context here is something like:

Some people say shared environment effects are ≈0 in twin & adoption studies, therefore we should believe “the bio-determinist child-rearing rule-of-thumb”. But in fact, parenting often involves treating different kids differently, so ‘shared environment effects are ≈0’ is irrelevant, and therefore we should reject “the bio-determinist child-rearing rule-of-thumb” after all.

If that’s the context, then I basically disagree. L... (read more)

4Kaj_Sotala
My context is most strongly the one where I'm trying to reconcile the claims from therapy vs. heredity. I know we did already agree on one particular mechanism by which they could be reconciled, but just that by itself doesn't feel like it would explain some of the therapy claims where very specific things seem to be passed on from parents. But yeah, I think that does roughly correspond to arguing over whether the bio-determinist child-rearing rule of thumb applies or not.  On one hand, this does make sense. On the other hand - as far as I know, even the researchers who argue for the strongest bio-determinist case will make the caveat that of course none of this applies to cases of sufficiently extreme abuse, which will obviously mess someone up. But... if that is in fact the case, shouldn't it by your argument show up as a shared environment effect? I can think of a few different explanations: * Even extreme childhood abuse doesn't have a major effect on life outcomes. * (Including this one for completeness though I consider it obviously implausible.) * The level of abuse that would affect life outcomes is rare enough not to be picked up on in the studies. * The methodology of the studies creates on floor on the badness of outcomes that gets picked up; e.g. maybe adoptive parents are screened well enough to make the worst abuse not happen, and the people drawn from national twin registers and contacted to fill in surveys don't bother responding if their lives are so messed up they don't have the time or energy for that. * But at least studies that use national registers about e.g. incarceration should be able to control for this. * There's something wrong about the correlation argument. When I asked Claude about this, it claimed that actually, studies done with national registers find a significant shared environment effect on antisocial behavior and criminality. It gave me this cite which reports a 26% shared environment effect on antisocial behavi

Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.

I don’t think this approach would lead to an AI that can autonomously come up with a new out-of-the-box innovative ... (read more)

3RogerDearnaley
It's unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO's email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical. In practice, I suspect we'll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we'll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn't been specifically trained on.

I was trying to argue in favor of:

CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).

It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?

2RogerDearnaley
There are certainly things that it's easier to do with RL — whether it's ever an absolute requirement I'm less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that's the case I'm not familiar with the details — I'd love references to anything relevant to this, if anyone has them. My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it's basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.

Hmm, you’re probably right.

But I think my point would have worked if I had suggested a modified version of Go rather than chess?

2RogerDearnaley
There's not a lot of scope for aligned/unaligned behavior in Go (or chess): it's a zero-sum game, so I don't see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.

Let me give you a detailed presciption…

For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.

People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).

And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s a... (read more)

3RogerDearnaley
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution. The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we'd like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.
8Zack_M_Davis
This isn't even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You're probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won't be correct, but you could just ... make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn't incorporate neural networks until 2020.)

Right, but what I'm saying is that there's at least a possibility that RL is the only way to train a frontier system that's human-level or above.

In that case, if the alignment plan is "Well just don't use RL!", then that would be synonymous with "Well just don't build AGI at all, ever!". Right?

...And yeah sure, you can say that, but it would be misleading to call it a solution to inner alignment, if indeed that's the situation we're in.

6RogerDearnaley
Why would we have to use RL to do this? The problem of building a rater for RL closely resembles automating the labelling problem for preparing the dataset for SGD safety pretraining, except that for online RL the rater is harder: it has to run fast, it can't be human assisted, and it has to be able to cope with arbitrary adversarial shifts in the distribution being rated and do so well enough for it to not have exploitable flaws. A rater for (or at least attaching ratings to the episode set for) offline RL is less bad: it's an almost equivalent problem to labelling a dataset for SGD, just attaching a score rather than a binary classification. The primary difference is that for the security pretraining approach the behavior we're training into the model is a classifier that labels behavior either good or bad, so isn't prone to Goodharting when you run it and ask for output from just one of the two categories, whereas for offline RL we're training a policy that tries to maximize the goodness rating, so is prone to Goodharting when the gradient towards the very "best" behavior leads it outside the training distribution. (The reason the SGD-trained classifier is safe is closely related to the satisficing approach to avoid Goodhart's Law.) So from the rating and stability point of view online RL is more challenging than offline RL, which is more challenging than security pretraining SGD. Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples? Why do you think there at least a possibility that RL could be the only way to train a frontier system that's human-level or above? I'm not currently seeing any potential advantage of RL — other than the fact it induces distribution shifts, during training for online RL, or after it for offline RL, so doesn't require us to already know the distribution we want: but these distribution shifts are exactly the source of its danger.

There’s a potential failure mode where RL (e.g. RLVR or otherwise) is necessary to get powerful capabilities. Right?

I for one don’t really care about whether the LLMs of May 2025 are aligned or not, because they’re not that capable. E.g. they would not be able to autonomously write a business plan and found a company and grow it to $1B/year of revenue. So something has to happen between now and then to make AI more capable. And I for one expect that “something” to involve RL, for better or worse (well, mostly worse). I’ve been saying that RL is necessary f... (read more)

3RogerDearnaley
My concern is that, if you're using RL to train a frontier system that's human-level or above, for alignment or capabilities purposes, is that it will inevitably find ways to abuse flaws in out RL rating system. One exception might be if the RL is for some capability like reasoning to produce a proof that passes proof checking, where it might be possible to create a rating system that actually has no flaws to exploit. I don't see how we could do that for RL for alignment, however.

(Thanks for your patient engagement!)

If you believe

  • it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
  • it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games

then I’m curious what accounts for the difference, in your mind?

More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:

(A1) AlphaZero goes from ... (read more)

2Cole Wyeth
I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions - e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.

Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.

I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)

Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.

What is this system that will update the weights? I have opinions, but in general, there are lots of possible ap... (read more)

2Cole Wyeth
That’s not what I have in mind, see my more most recent reply.  Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position. 

Great, glad we agree on that!

Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT

“an agent trained through imitation learning”,

but rather

“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.

Right?

And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?

4Cole Wyeth
I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn - but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow. Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though - even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer - including “in-context” learning as necessary. 

Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.

Is this a way to operationalize our disagreement?

CLAIM:

Take AlphaZero-chess and train it (via self-play RL as usual) from scratch to Elo 2500 (grandmaster level), but no further.

Now take a generic DNN like a transformer. Give it training data showing how AlphaZero-in-training developed from Elo 0, to Elo 1, … to Elo 1000, to Elo 1001, to Elo

... (read more)
7gwern
As I've said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer). All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It's not magic, and can obviously be encoded into a LLM's generative process. Here's a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. "Move #4 was better than it looked, so I will +0.01 to the value estimate for it." This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely. It repeatedly iterates through a possible game, evaluating each position to a certain depth, upda
4Cole Wyeth
The claim is certainly false.  Before LLMs reach AGI, someone will have to solve efficient, online continual learning. This is an open technical problem, which is why I doubt that the current paradigm scales to superintelligence. It seems that an appropriate solution for general-purpose agents would also lead to a solution for agents trained through imitation learning. 

You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.

AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that who... (read more)

4Cole Wyeth
I think your intuitions here are highly misguided. I don’t agree with your conclusions about AlphaZero at all. You could easily train a model by distilling AlphaZero. All the complicated steps are only necessary to bootstrap from nothing. 

It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)

But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.

Rather, it would follow that imitation learning models... (read more)

4Cole Wyeth
I see no reason to think imitation learning is particularly unable to generalize OOD.  After all, humans learn. A sufficiently good imitation of a human should also learn. Perhaps you are simply imaging imitation learning on a too-restricted dataset. 
Load More