Right, there’s a possible position which is: “I’ll accept for the sake of argument your claim there will be an egregiously misaligned ASI requiring very little compute (maybe ≲1 chip per human equivalent including continuous online learning), emerging into a world not terribly different from today’s. But even if so, that’s OK! While the ASI will be a much faster learner than humans, it will not magically know things that it has no way to have figured out (§1.8.1), and that includes developing nanotechnology. So it will be reliant on humans and human infras...
(partly copying from other comment)
I would have assumed that this Python would be impossible to get right
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and i...
My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”
I don’t think anything about human brains and their evolution cuts against this claim.
If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be ...
much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them
If the plan is what I call...
I’m worried about treacherous turns and such, and I don’t think any of the things you mentioned are relevant to that:
It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
...It's possible to make the classifie
In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
That’s not quite my position.
Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focuse...
Thanks!
This is a surprising prediction because it seems to run counter to Rich Sutton's bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I'm just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.
No, I’m also talking about “gene...
Part of the problem IMO is that both sides of that conversation seem to be mainly engaging in a debate over “yay Bengio” versus “boo Bengio”. That kind of debate (I call it “vibes-based meaningless argument”) is not completely decision-irrelevant, but IMO people in general are motivated to engage in it way out of proportion to its very slight decision-relevance†. Better to focus on what’s really decision-relevant here: mainly “what is the actual truth about reward tampering?” (and perhaps “shall I email the authors and bug them to rewrite that section?” or...
Good questions!
Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech?
I think you’re thinking about that kinda the wrong way around.
You’re treating “the things that Zoe does when she wants to get my attention” as a cause, and “my brain reacts to that” as the effect.
But I would say that a better perspective is: everybody’s brain reacts to various cues (sound level, pitch, typical learned associations, etc.), and Zoe has learned through life experience how to get a person’s attention by tapping int...
Thanks!
Hmm, here’s a maybe-interesting example (copied from other comment):
If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me.
What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical...
I would think that reproducing cortical learning would require a good deal of work and experimentation, and I wouldn't expect working out the "algorithm" to happen all at once
I agree; working out the "algorithm" is already happening, and has been for decades. My claim instead is that by the time you can get the algorithm to do something importantly useful and impressive—something that LLMs and deep learning can’t already do much cheaper and better—then you’re almost at ASI. Note that we have not passed this threshold yet (no offense). See §1.7.1.
...or to be v
In §1.8.1 I also mentioned going from zero to beyond-world-expert-level understanding of cryptocurrency over the course of 24 hours spent reading up on the topic and all its prerequisites, and playing with the code, etc.
And in §3.2 here I talked about a hypothetical AI that, by itself, is running the equivalent of the entire global human R&D enterprise, i.e. the AI is running hundreds of thousands of laboratory experiments simultaneously, 24 hours a day, publishing millions of papers a year, and synthesizing brilliant new insights in every field at onc...
Thanks! Here’s a partial response, as I mull it over.
Also, I'd note that the brain seems way more complex than LLMs to me!
See “Brain complexity is easy to overstate” section here.
basically all paradigms allow for mixing imitation with reinforcement learning
As in the §2.3.2, if an LLM sees output X in context Y during pretraining, it will automatically start outputting X in context Y. Whereas if smart human Alice hears Bob say X in context Y, Alice will not necessarily start saying X in context Y. Instead she might say “Huh? Wtf are you talking about Bob?”
L...
The second line is about learning speed and wall-clock time. Of course AI can communicate and compute orders of magnitude faster than humans, but there are other limiting factors to learning rate. At some point, the AI has to go beyond the representations that can be found or simulated within the digital world and get its own data / do its own experiments in the outside world.
Yup, I addressed that in §1.8.1:
...To be clear, the resulting ASI after those 0–2 years would not be an AI that already knows everything about everything. AGI and ASI (in my opinion
Cool paper, thanks!
training a group of LLMs…
That arxiv paper isn’t about “LLMs”, right? Really, from my perspective, the ML models in that arxiv paper have roughly no relation to LLMs at all.
Is this a load-bearing part of your expectation of why transformer-based LLMs will hit a scaling wall?
No … I brought this up to make a narrow point about imitation learning (a point that I elaborate on much more in §2.3.2 of the next post), namely that imitation learning is present and very important for LLMs, and absent in human brains. (And that arxiv paper is unrela...
Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.
For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is...
I don’t think GPUs would be the best of all possible chip designs for the next paradigm, but I expect they’ll work well enough (after some R&D on the software side, which I expect would be done early on, during the “seemingly irrelevant” phase, see §1.8.1.1). It’s not like any given chip can run one and only one algorithm. Remember, GPUs were originally designed for processing graphics :) And people are already today running tons of AI algorithms on GPUs that are not deep neural networks (random example).
Can you expand your argument why LLM will not reach AGI?
I’m generally not very enthusiastic about arguing with people about whether LLMs will reach AGI.
New large-scale learning algorithms can in principle be designed by (A) R&D (research taste, small-scale experiments, puzzling over the results, iterating, etc.), or (B) some blind search process. All the known large-scale learning algorithms in AI to date, from the earliest Perceptron to the modern Transformer, have been developed by (A), not (B). (Sometimes a few hyperparameters or whatever are set by blind search, but the bulk of the real design work in the learning algorithm has always come from intelligent R&D.) I expect that to remain the cas...
I definitely don’t think we’ll get AGI by people scrutinizing the human genome and just figuring out what it’s doing, if that’s what you’re implying. I mentioned the limited size of the genome because it’s relevant to the complexity of what you’re trying to figure out, for the usual information-theory reasons (see 1, 2, 3). “Machinery in the cell/womb/etc.” doesn’t undermine that info-theory argument because such machinery is designed by the genome. (I think the epigenome contains much much less design information than the genome, but someone can tel...
I think your comment is poorly worded, in that you’re stating certain trend extrapolations as facts rather than hypotheses. But anyway, yes my position is that LLMs (including groups of LLMs) will be unable to autonomously write a business plan and then found a company and grow it to $1B/year revenue, all with zero human intervention, 25 years from now.
My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.
“completely and resoundingly fucked” is mildly overstated but mostly “Yes, that's my position”, see §1.6.1, 1.6.2, 1.8.4.
It sounds like you're suggesting that inventing grammar is the convergent result of a general competency?
There are some caveats, but more-or-less, yeah. E.g. the language-processing parts of the cortex look pretty much the same as every other part of the neocortex. E.g. some people talk about how language is special because it has “recursion”, but in fact we can also handle “recursion” perfectly well in vision (e.g. we can recognize a picture inside a picture), planning (e.g. we can make a plan that incorporates a sub-plan), etc.
Yeah to some extent, although that can be a motivation problem as well as a capability problem. Depends on how large is the “large scale project”.
I think almost all humans can and do “autonomously” execute projects that are well beyond today’s LLMs. I picked a hard example (founding and growing a company to $1B/year revenue) just for clarity.
Random website says 10% of the USA workforce (and 50% of the global workforce!?) is self-employed.
I think a big difference between functional organizations and dysfunctional bureaucracies is that the employees at funct...
Thanks! It’s a bit hard for me to engage with this comment, because I’m very skeptical about tons of claims that are widely accepted by developmental psychologists, and you’re not.
So for example, I haven’t read your references, but I’m immediately skeptical of the claim that the cause of kids learning object permanence is “gradual exposure in cultural environments where adults consistently treat objects as permanent entities”. If people say that, what evidence could they have? Have any children been raised in cultural environments where adults don’t treat ...
your argument here is modus tollens…
From my perspective you’re being kinda nitpicky, but OK sure, I have now reworded from:
…and the “could” captures the fact that a simulation can also fail in other ways, e.g. you need to ensure adequa...
I just reworded from “as a failed prediction” to “as evidence against Eliezer’s judgment and expertise”. I agree that the former was not a good summary, but am confident that the latter is what Paul intended to convey and expected his readers to understand, based on the context of disagreement 12 (which you quoted part but not all of). Sorry, thanks for checking.
I find your comment kinda confusing.
My best guess is: you thought that I was making a strong claim that there is no aspect of LLMs that resembles any aspect of human brains. But I didn’t say that (and don’t believe it). LLMs have lots of properties. Some of those LLM properties are similar to properties of human brains. Others are not. And I’m saying that “the magical transmutation of observations into behavior” is in the latter category.
Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”? ...
Well then so much the worse for “the Surfing Uncertainty theory of the brain”! :)
See my post Why I’m not into the Free Energy Principle, especially §8: It’s possible to want something without expecting it, and it’s possible to expect something without wanting it.
Thanks! I suppose I didn’t describe it precisely, but I do think I’m pointing to a real difference in perspective, because if you ask this “LLM-focused AGI person” what exactly the R&D work entails, they’ll almost always describe something wildly different from what a human skill acquisition process would look like. (At least for the things I’ve read and people I’ve talked to; maybe that doesn’t generalize though?)
For example, if the task is “the AI needs to run a restaurant”, I’d expect the “LLM-focused AGI person” to talk about an R&D project tha...
I was talking about existing models in the literature of what the 6ish different layers of the cortex do and how. These models are so extremely basic that it’s obvious to everyone, including their authors, that they are incomplete and basically useless, except as a step towards a future better model. I am extremely confident that there is no possible training environment that would lead a collaborative group of these crappy toy models into inventing language, science, and technology from scratch, as humans were able to do historically.
Separately, when some...
When you say “the effects of RL in LLMs”, do you mean RLHF, RLVR, or both?
Let’s distinguish “motivation theory” (savants spend a lot of time practicing X because they find it motivating, and get really good at X) from “learning algorithm hyperparameter theory” (savants have systematically different … ML learning rates? neural architectures (e.g. fiber densities, dendrite branching properties, etc.)? loss functions? etc.). (Needless to say, these are not mutually exclusive.)
I interpret your comment as endorsing motivation theory for explaining savants. Whereas it seems to me that at least for memory savants like Kim Peek (who mem...
Thanks!
...I can think of a few different explanations:
- Even extreme childhood abuse doesn't have a major effect on life outcomes.
- (Including this one for completeness though I consider it obviously implausible.)
- The level of abuse that would affect life outcomes is rare enough not to be picked up on in the studies.
- The methodology of the studies creates on floor on the badness of outcomes that gets picked up; e.g. maybe adoptive parents are screened well enough to make the worst abuse not happen, and the people drawn from national twin registers and contacted to
Like I always say, the context in which you’re bringing up heritability matters. It seems that the context here is something like:
Some people say shared environment effects are ≈0 in twin & adoption studies, therefore we should believe “the bio-determinist child-rearing rule-of-thumb”. But in fact, parenting often involves treating different kids differently, so ‘shared environment effects are ≈0’ is irrelevant, and therefore we should reject “the bio-determinist child-rearing rule-of-thumb” after all.
If that’s the context, then I basically disagree. L...
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
I don’t think this approach would lead to an AI that can autonomously come up with a new out-of-the-box innovative ...
I was trying to argue in favor of:
CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).
It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?
Hmm, you’re probably right.
But I think my point would have worked if I had suggested a modified version of Go rather than chess?
Let me give you a detailed presciption…
For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).
And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s a...
Right, but what I'm saying is that there's at least a possibility that RL is the only way to train a frontier system that's human-level or above.
In that case, if the alignment plan is "Well just don't use RL!", then that would be synonymous with "Well just don't build AGI at all, ever!". Right?
...And yeah sure, you can say that, but it would be misleading to call it a solution to inner alignment, if indeed that's the situation we're in.
There’s a potential failure mode where RL (e.g. RLVR or otherwise) is necessary to get powerful capabilities. Right?
I for one don’t really care about whether the LLMs of May 2025 are aligned or not, because they’re not that capable. E.g. they would not be able to autonomously write a business plan and found a company and grow it to $1B/year of revenue. So something has to happen between now and then to make AI more capable. And I for one expect that “something” to involve RL, for better or worse (well, mostly worse). I’ve been saying that RL is necessary f...
(Thanks for your patient engagement!)
If you believe
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from ...
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible ap...
Great, glad we agree on that!
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
Is this a way to operationalize our disagreement?
...CLAIM:
Take AlphaZero-chess and train it (via self-play RL as usual) from scratch to Elo 2500 (grandmaster level), but no further.
Now take a generic DNN like a transformer. Give it training data showing how AlphaZero-in-training developed from Elo 0, to Elo 1, … to Elo 1000, to Elo 1001, to Elo
You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that who...
It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)
But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.
Rather, it would follow that imitation learning models...
OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar pa... (read more)