I think this post greatly misunderstands mine.
Firstly, I'd like to address the question of epistemics.
When I said "there's no reason to reference evolution at all when forecasting AI development rates", I was referring to two patterns of argument that I think are incorrect: (1) using the human sharp left turn as evidence for an AI sharp left turn, and (2) attempting to "rescue" human evolution as an informative analogy for other aspects of AI development.
(Note: I think Zvi did follow my argument for not drawing inferences about the odds of the sharp left turn specifically. I'm still starting by clarifying pattern 1 in order to set things up to better explain pattern 2.)
Pattern 1: using the human sharp left turn as evidence for an AI sharp left turn.
The original sharp left turn post claims that there are general factors about the structure and dynamics of optimization processes which both caused the evolutionary sharp left turn, and will go on to cause another sharp left turn in AI systems. The entire point of Nate referencing evolution is to provide evidence for these factors.
My counterclaim is that the causal processes responsible for the evolutionary sharp left turn are almost entirely distinct from anything present in AI development, and so the evolutionary outcome is basically irrelevant for thinking about AI.
From my perspective, this is just how normal Bayesian reasoning works. If Nate says:
P(human SLT | general factors that cause SLTs) ~= 1
P(human SLT | NOT general factors that cause SLTs) ~= 0
then observing the human SLT is very strong evidence for there being general factors that cause SLTs in different contexts than evolution.
OTOH, I am saying:
P(human SLT | NOT general factors that cause SLTs) ~= 1
And so observing the human SLT is no evidence for such general factors.
Pattern 2: attempting to "rescue" human evolution as an informative analogy for other aspects of AI development.
When I explain my counterargument to pattern 1 to people in person, they will very often try to "rescue" evolution as a worthwhile analogy for thinking about AI development. E.g., they'll change the analogy so it's the programmers who are in a role comparable to evolution, rather than SGD.
I claim that such attempted inferences also fail, for the same reason as argument pattern 1 above fails: the relevant portions of the causal graph driving evolutionary outcomes is extremely different from the causal graph driving AI outcomes, such that it's not useful to use evolution as evidence to make inferences about nodes in the AI outcomes causal graph. E.g., the causal factors that drive programmers to choose a given optimizer are very different from the factors that cause evolution to "choose" a given optimizer. Similarly, evolution is not a human organization that makes decisions based on causal factors that influence human organizations, so you should look at evolution for evidence of organization-level failures that might promote a sharp left turn in AI.
Making this point was the purpose of the "alien space clowns" / EVO-Inc example. It was intended to provide a concrete example of two superficially similar seeming situations, where actually their causal structures are completely distinct, such that there are no useful updates to make from EVO-Inc's outcomes to other automakers. When Zvi says:
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
Zvi is proposing that there are common causal factors that led to the alien clowns producing dangerous cars, and could also play a similar role in causing other automakers to make unsafe vehicles, such that Evo-Inc's outcomes provide useful updates for predicting other automakers' outcomes. This is what I'm saying is false about evolution versus AI development.
At this point, I should preempt a potential confusion: it's not the case that AI development and human evolution share zero causal factors! To give a trivial example, both rely on the same physical laws. What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs. When you update your estimates for the shared factors between the graphs using evidence from evolution, this leads to trivial or obvious implications for AI development, because the shared causal factors play different roles in the two graphs. You can have an entirely "benign" causal graph for AI development, which predicts zero alignment issues for AI development, yet when you build the differently structured causal graph for human evolution, it still predicts the same sharp left turn, despite some of the causal factors being shared between the graphs.
This is why inferences from evolutionary outcomes to AI development don't work. Propagating belief updates through the evolution graph doesn't change any of the common variables away from settings which are benign in the AI development graph, since those settings already predict a sharp left turn when they're used in the evolution graph.
Concrete example 1: We know from AI development that having a more powerful optimizer, running for more steps, leads to more progress. Applying this causal factor to the AI development graph basically predicts "scaling laws will continue", which is just a continuation of the current trajectory. Applying the same factor to the evolution graph, combined with the evolution-specific fact of cultural transmission enabling a (relatively) sudden unleashing of ~9 OOM more effectively leveraged optimization power in a very short period of time, predicts an extremely sharp increase in the rate of progress.
Concrete example 2: One general hypothesis you could have about RL agents is "RL agents just do what they're trained to do, without any weirdness". (To be clear, I'm not endorsing this hypothesis. I think it's much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are "trained" on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Secondly, I'd like to address a common pattern in a lot of Zvi's criticisms.
My post has a unifying argumentative structure that Zvi seems to almost completely miss. This leads to a very annoying dynamic where:
The unifying argumentative structure of my post is as follows:
Having outlined my argumentative structure, I'll highlight some examples where Zvi's criticisms fall into the previously mentioned dynamic.
1:
[Zvi] He then goes on to make another very broad claim.
[Zvi quoting me] > In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
[I list some ways one could produce an ML training process that's actually similar to human evolution in the relevant sense that would lead to an evolution-like sharp left turn at some point]
[Zvi criticizes the above list on the grounds that inner misalignment could occur under a much broader range of circumstances than I describe]
(I added the bolding)
The issue here is that the list in question is specifically for sharp left turns that arise "due to the same mechanistic reasons as the sharp left turn of human evolution", as I very specifically said in my original post. I'm not talking about inner alignment in general. I'm not even talking about sharp left turn threat scenarios in general! I'm talking very specifically about how the current AI paradigm would have to change before it had a mechanistic structure sufficiently similar to human evolution that I think a sharp left turn would occur "due to the same mechanistic reasons as the sharp left turn of human evolution".
2:
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
That is emphatically not what those sections are arguing for. The purpose of these sections is to describe two non-sharp left turn causing mechanisms for fast takeoff, in order to better illustrate that fast takeoff != sharp left turn. Each section specifically focuses on a particular mechanism of fast takeoff, and argues that said mechanism will not, in and of itself, lead to misalignment. You can still believe a fast takeoff driven by that mechanism will lead to misalignment for other reasons (e.g., a causal graph that looks like: "(fast takeoff mechanism) -> (capabilities) -> (something else) -> (misalignment)"), if, say, you think there's another causal mechanism driving misalignment, such that the fast takeoff mechanism's only contribution to misalignment was to advance capabilities in a manner that failed to address that other mechanism.
These sections are not arguing about the ease of alignment in general, but about the consequence of one specific process.
3:
The next section seems to argue that because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems with greater capabilities.
That is, even more emphatically, not what that specific section is arguing for. This section focuses specifically on the "AIs do AI capabilities research" mechanism of fast takeoff, and argues that it will not itself cause misalignment. Its purpose is specific to the context in which I use it: to address the causal influence of (AIs do capabilities research) directly to (misalignment), not to argue about the odds of misalignment in general.
Further, the argument that section made wasn't:
because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems
It was:
alignment techniques already generalize across human contributions to AI capability research. Let’s consider eight specific alignment techniques:
[list of alignment techniques]
and eleven recent capabilities advances:
[list of capabilities techniques]
I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances.
And so, if you think AIs doing capabilities will be like humans doing capabilities research, but faster, then there will be a bunch of capabilities and alignment techniques, and the question is how much the capabilities techniques will interfere with the alignment techniques. Based on current data, the interference seems small and manageable. This is the trend being projected forwards, the lack of empirical interference between current capabilities and alignment (despite, as I note in my post, current capabilities techniques putting ~zero effort into not interfering with alignment techniques, an obviously dumb oversight which we haven't corrected because it turns out we don't even need to do so).
Once again, I emphasize that this is not a general argument about alignment, which can be detached from the rest of the post. It's extremely specific to the mechanism for fast takeoff being analyzed, which is only being analyzed to further explore the connection between fast takeoff mechanisms and the odds of a sharp left turn.
4:
He closes by arguing that iteratively improving training data also exhibits important differences from cultural development, sufficient to ignore the evolutionary evidence as not meaningful in this context. I do not agree. Even if I did agree, I do not see how that would justify his broader optimism expressed here:
This part is a separate analysis of a different fast takeoff causal mechanism, arguing that it will not, itself cause misalignment either. Its purpose and structure mirrors that of the argument I clarified above, but focused on a different mechanism. It's not a continuation of a previous (non-existent) "alignment is easy in general" argument.
Thirdly, I'd like to make some random additional commentary.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
The degree of similarity is arguable. I think, and said in the original article, that similarity is low for the first mechanism and moderate for the second.
However, the appropriate way to estimate the odds of a given fast takeoff mechanism leading to AI misalignment is not to estimate the similarity between that mechanism and what happened during human evolution, then assign misalignment risk to the mechanism in proportion to the estimated similarity. Rather, the correct approach is to build detailed causal models of how both human evolution and AI development work, propagate the evidence from human evolutionary outcomes back through your human evolution causal model to update relevant latent variables in that causal model, transfer those updates to any of the AI development causal model's latent variables which are also in the human evolution causal model, and finally estimate the new misalignment risk implied by the updated variables of the AI development model.
I discussed this in more detail in the first part of my comment, but whenever I do this, I find that the transfer from (observations of evolutionary outcomes) to (predictions about AI development) are pretty trivial or obvious, leading to such groundbreaking insights as:
That seems like a sharp enough left turn to me.
A sharp left turn is more than just a fast takeoff. It's the combined sudden increase in AI generality and breaking of previously existing alignment properties.
...humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
I find this perspective baffling. Where else do the alignment properties of a system derive from? If you have a causal structure like
(programmers) -> (training data, training conditions, learning dynamics, etc) -> (alignment properties)
then setting the value of the middle node will of course screen off the causal influence of the (programmers) node.
A possible clarification: in the context of my post when discussing evolution, "inner optimizer" means the brain's "base" optimization process, not the human values / intelligence that arises from that process. The mechanistically most similar thing in AI development to that meaning of the word "inner optimizer" is the "base" training process: the combination of training data, base optimizer, training process, architecture, etc. It doesn't mean the cognitive system that arises as a consequence of running that training process.
Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I think Zvi is describing a ~impossible world. I think this world would basically break ~all my models on how optimizing processes gain capabilities. My new odds of an AI sharp left turn would depend on the new models I made in this world, which in turn would depend on unspecified details of how human civilization's / AI progress happens in this world.
I would also note that Quintin in my experience often cites parallels between humans and AIs as a reason to expect good outcomes from AI due to convergent outcomes, in circumstances where it would be easy to find many similar distinctions between the two cases. Here, although I disagree with his conclusions, I agree with him that the human case provides important evidence.
Once again, it's not the degree of similarity that determines what inferences are appropriate. It's the relative structure of the two causal graphs for the processes in question. The graphs for the human brain and current AI systems are obviously not the same, but they share latent variables that serve similar roles in determining outcomes, in a way that the bi-level structure of evolution's causal graph largely prevents. E.g., Steven Byrnes has a whole sequence which discusses the brain's learning process, and while there are lots of differences between the brain and current AI designs, there are also shared building blocks whose behaviors are driven by common causal factors. The key difference with evolution is that, once one updates the shared variables from looking at human brain outcomes and applies those updates to the AI development graph, there are non-trivial / obvious implications. Thus, one can draw relevant inferences by observing human outcomes.
Concrete example 1: brains use a local, non-gradient based optimization process to minimize predictive error, so there exists some non-SGD update rules that are competitive with SGD (on brainlike architectures, at least).
Concrete example 2: brains don't require GPT-4 level volumes of training data, so there exist architectures with vastly more data-friendly scaling laws than GPT-4's scaling.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.
What I specifically said was:
Autonomous learning basically requires there to be a generator-discriminator gap in the domain in question, i.e., that the agent trying to improve its capabilities in said domain has to be better able to tell the difference between its own good and bad outputs.
I realize this is accidentally sounds like it's saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). To clarify, I think it's the agent's capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don't think the (basic) dynamics are too difficult to write down.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample's specific level of capabilities. Note that C doesn't have to just be an ML model. It could be any process at all, including "ask a human", "interpret the sample as a computer program trying to solve some problem, run the program, and score the output", etc.
Having C allows you to sample from a version of M's output distribution that has been "updated" on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D', which you can then train M' on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M' / M'' / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C' off of M', and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you'll eventually stop benefiting from this process.
This process goes further in domains where it's easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There's some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That's why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
There are other issues I have with this post, but my reply is already longer than the entire original post, so I'll stop here, rather than, say, adding an entire additional section on my models of takeoff speed for AIs versus evolution (which I'll admit probably should have another post to go with it).
Thank you for the very detailed and concrete response. I need to step through this slowly to process it properly and see the extent to which I did misunderstand things, or places where we disagree.
I realize this is accidentally sounds like it's saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). I think it's the agent's capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don't think the (basic) dynamics are too difficult.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample's specific level of capabilities. Note that C doesn't have to just be an ML model. It could be any process at all, including "ask a human", "interpret the sample as a computer program trying to solve some problem, run the program, and score the output", etc.
Having C allows you to sample from a version of M's output distribution that has been "updated" on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D', which you can then train M' on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M' / M'' / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C' off of M', and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you'll eventually stop benefiting from this process.
This process goes further in domains where it's easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There's some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That's why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C' that can let you move 'up the chain' of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don't have any reason to presume we have a way to 'rescue ourselves from' in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)
On Quintin's secondly's concrete example 1 from above:
I think the core disagreement here is that Quintin thinks that you need very close parallels in order for the evolutionary example to be meaningful, and I don't think that at all. And neither of us can fully comprehend why the other person is going with as extreme a position as we are on that question?
Thus, he says, yes of course you do not need all those extra things to get misalignment, I wasn't claiming that, all I was saying was this would break the parallel. And I'm saying both (1) that misalignment could happen these other ways which he agrees with in at least some cases (but perhaps not all cases) and (2) also I do not think that these extra clauses are necessary for the parallel to matter.
And also (3) yes, I'll totally cop to, because I don't see why the parallel is in danger with these changes, I didn't fully have in my head the distinction Quintin is trying to draw here, when I was writing that.
But I will say that, now that I do have it in my head, that I am at least confused why those extra distinctions are necessary for the parallel to hold, here? Our models of what is required here are so different that I'm pretty confused about it, and I don't have a good model of why e.g. it matters that there are 9 OOMs of difference, or that the creation of the inner optimizer is deliberate (especially given that nothing evolution did was in a similar sense deliberate, as I understand these things at least - my model is that evolution doesn't do deliberate things at all). And in some cases, to the extent that we want a tighter parallel, Quintin's requirements seem to move away from that? Anyway, I notice I am confused.
Concrete example 4: Am I wrong here that you're arguing that this path still exhibits key differences from cultural development and thus evolution does not apply? And then you also argue that this path does not cause additional severe alignment difficulties beyond those above. So I'm not sure where the misreading is here. After that, I discuss a disagreement with a particular claim.
(Writing at comment-speed, rather than carefully-considered speed, apologies for errors and potential repetitions, etc)
On the Evo-Clown thing and related questions in the Firstly section only.
I think we understand each other on the purpose of the Evo-Clown analogy, and I think it is clear what our disagreement is here in the broader question?
I put in the paragraph Quintin quoted in order to illustrate that, even in an intentionally-absurd example intended to illustrate that A and B share no causal factors, A and B still share clear causal factors, and the fact that A happened this way should give you substantial pause about the prospects for B, versus A never having happened at all and the things that caused A not having happened. I am curious (since Quintin does not comment) whether he agrees about the example, now that I bring up the reasons to be concerned.
The real question is the case of evolution versus AI development.
I got challenged by Quintin and by others as interpreting Quintin too broadly when I said:
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
In response to Quintin saying:
- THEN, there's no reason to reference evolution at all when forecasting AI development rates, not as evidence for a sharp left turn, not as an "illustrative example" of some mechanism / intuition which might supposedly lead to a sharp left turn in AI development, not for anything.
I am happy to accept the clarification that I interpreted Quintin's statement stronger than he intended it.
I still am confused how else I could have interpreted the original statement? But that does not matter, what matters is the disagreements we still clearly do have here.
I now understand Quintin's model as saying (based on the comment plus his OP) that evolution so obviously does an overdetermined sharp left turn that it isn't evidence of anything (e.g. that the world I proposed as an alternative breaks so many of his models that it isn't worth considering)?
I agree that if evolution's path is sufficiently overdetermined, then there's no reason to cite that path as evidence. In which case we should instead be discussing the mechanisms that are overdetermining that result, and what they imply.
I think the reason we talk about evolution here is exactly because for most people, the underlying mechanisms very much aren't obvious and overdetermined before looking at the results - if you skipped over the example people would think you were making a giant leap.
Concrete example 2: One general hypothesis you could have about RL agents is "RL agents just do what they're trained to do, without any weirdness". (To be clear, I'm not endorsing this hypothesis. I think it's much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are "trained" on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Ok, sure. I agree that for any given pair of facts there is essentially nothing to infer from one about the other, given what else we already know, and that the two facts Quintin cites as an example are a valid example. But it seems wrong to say that AI developments and evolutionary developments relate to each other in a similar way or reference class to a speck on your pillow to the nuclear weapons program? Or that the distinctions proposed should generally be of a sufficient degree to imply there are no implications from one to the other?
What I was saying that Quintin is challenging in the second paragraph above, specifically, was not that for observations A and B it would be unusual for A to not have important implications for B. What I was saying was that there being distinctions in the causal graphs behind A and B is not a good reason to dismiss A having implications for B - certainly differences reduce it somewhat, but most of the time that A impacts B, there are important causal graph differences that could draw a similar parallel. And, again, this would strike down most reference class arguments.
Quintin does say there are non-zero implications in the comment, so I suppose the distinction does not much matter in the end. Nor does it much matter whether we are citing evolution, or citing our underlying models that also explain evolution's outcomes, if we can agree on those models?
As in, we would be better served looking at:
One general hypothesis you could have about RL agents is "RL agents just do what they're trained to do, without any weirdness." In the context of AI development, this has pretty benign implications.
I think I kind of... do believe this? For my own perhaps quite weird definitions of 'weirdness' and 'what you train it for'? And for those values, no, this is not benign at all, because I don't consider SLT behaviors to be weird when you have the capabilities for them. That's simply what you would expect, including from a human in the same spot, why are we acting so surprised?
If you define 'weirdness' sufficiently differently then it would perhaps be benign, but I have no idea why you would expect this.
And also, shouldn't we use our knowledge of humans here, when faced with similar situations? Humans, a product of evolution, do all sorts of local SLTs in situations far removed from their training data, the moment you give them the affordance to do so and the knowledge that they can.
It is also possible we are using different understandings of SLT, and Quintin is thinking about it more narrowly than I am, as his later statements imply. In that case, I would say that I think the thing I care about, in terms of whether it happens, is the thing (or combination of things) I'm talking about.
Thus, in my view, humans did not do only the one big anti-evolution (?) SLT. Humans are constantly doing SLTs in various contexts, and this is a lot of what I am thinking about in this context.
What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs.
Aha?!?
Quintin, I think (?) is saying that the fact that evolution provided us with a central sharp left turn is not evidence, because that is perfectly compatible with and predicted by AI models that aren't scary.
So I notice I disagree with this twice.
First, I don't think that the second because clause entirely holds, for reasons that I largely (but I am guessing not entirely) laid out in my OP, for reasons that I am confident Quintin disagrees with and would take a lot to untangle, although I do agree there is some degree of overdeterminedness here where if we hadn't done the exact SLT we did but had still ramped up our intelligence, we would have instead done a slightly-to-somewhat different-looking SLT later.
Second, I think this points out a key thing I didn't say explicitly and should have, which is the distinction between the evidence that humans did all their various SLTs (yes, plural, both collectively and individually), and the evidence that humans did these particular SLTs in these particular ways because of these particular mechanisms. Which I do see as highly relevant.
I can imagine a world where humans did an SLT later in a different way, and are less likely to do them on an individual level (note: I agree that this may be somewhat non-standard usage of SLT, but hopefully it's mostly clear from context what I'm referring to here?) , and everything happened slower and more continuously (on the margin presumably we can imagine this without our models breaking, if only via different luck). And where we look at the details and say, actually it's pretty hard to get this kind of thing to happen, and moving humans out of their training distributions causes them to hold up in a way we'd metaphorically like out of AIs really well even when they are smart enough and have enough info and reflection time to know better, and so on.
(EDIT: It's late, and I've now responded in stages to the whole thing, which as Quintin noted was longer than my OP. I'm thankful for the engagement, and will read any further replies, but will do my best to keep any further interactions focused and short so this doesn't turn into an infinite time sink that it clearly could become, even though it very much isn't a demon thread or anything.)
When I explain my counterargument to pattern 1 to people in person, they will very often try to "rescue" evolution as a worthwhile analogy for thinking about AI development. E.g., they'll change the analogy so it's the programmers who are in a role comparable to evolution, rather than SGD.
In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record.
The reason for this is very simple - most intuitions or predictions one could make are wrong, and you need a lot of positive evidence to privilege any particular hypotheses re how or what to think. In the absence of evidence, you should stop relying on an intuition, or at least hold it very lightly.
On concrete example 2: I see four bolded claims in 'fast takeoff is still possible.' Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to 'alignment is easy.'
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn't one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with 'Let’s consider eight specific alignment techniques,' looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn't 'close enough' to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that's not the core purpose it serves in the underlying structure, and it's perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I'm sorry about that.
But also I notice I am confused here - if you didn't mean for the reader to make this generalization, if you don't think that failure of current capabilities advances to break current alignment techniques isn't strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques 'already generalize' (which they currently mostly do as far as I know, which is not terribly far) if it isn't a claim that they will likely generalize in the future?
On the additional commentary section:
On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn't care about 'degree of similarity' and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that's not what I'm getting out of it. It is possible that some of this is that a lot of what I'm thinking of as evolutionary evidence, you're thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don't).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I'm likely still conflating the two somewhat. That doesn't change me not finding this heartening, though? As in, we're going to see rapid big changes in both the inner optimizer's power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn't seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time - and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It's also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you're confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels - I'd encourage thinking about what similar logic would say in the other cases?
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!
The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
Nearly every piece of empirical evidence I've seen contradicts this - more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it's straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of "train for duration X" which clearly gets worse performance in realistic cases as you go toward more capable systems.)
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
Specifically, the point is that evolution's approach to alignment is very different and much worse than what we can do, so the evolution model doesn't suggest concern, since we are doing something very, very different and way better than what evolution did to align AIs.
Similarly, the mechanisms that allow for fast takeoff doesn't automatically mean that the inner optimizer is billions of times faster/more powerful. It's not human vs AI takeoff that matters, but SGD or the outer optimizer vs the inner optimizer that matters. And the implicit claim is that the outer/inner optimizer optimization differential will be vastly less unequal than the evolution/human optimization differential due to the very different dynamics of the AI situation.
Start with ‘deliberately.’ Why would that matter?
Because you can detect if an evolutionarily like sharp-left turn happened, and it's by doing this:
"If you suspect that you've maybe accidentally developed an evolution-style inner optimizer, look for a part of your system that's updating its parameters ~a billion times more frequently than your explicit outer optimizer."
And you can prevent it because we can assign basically whatever ratio of optimization steps between the outer and inner optimizer we want like so:
"Human "inner learners" take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want."
Step two seems rather arbitrary. Why billions?
It doesn't too much matter what the threshold is, but the point is that there needs to be a large enough ratio between the outer optimizer like SGD won't be able to update it in time.
Step three does not seem necessary at all. It does seem like currently we are doing exactly this, but even if we didn’t, the inner optimizer has more optimization pressure working for it in the relevant areas, so why would we presume that the outer optimizer would be able to supervise it effectively or otherwise keep it in check?
First, you're essentially appealing to the argument that the inner optimizer is way more effective than the outer optimizer like SGD, which seems like the second argument. Second of all, Quintin Pope responded to that in another comment, but the gist here is A: We can in fact directly reward models for IGF or human values, unlike evolution, and B: human values is basically the exact target for reward shaping, since they arose at all.
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=f2CamTeuxhpS2hjaq#f2CamTeuxhpS2hjaq
Another reason is that we are the innate reward system, to quote Nora Belrose, and once we use the appropriate analogies, we are able to do powerful stuff around alignment that evolution just can't do.
Also, some definitions here:
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
The key here is that there isn't nearly as large of a difference between the outer optimizer like SGD and the inner optimizer, since SGD is way more powerful than evolution since it can directly select over policies.
Also, this Quintin Pope's comment is worth sharing on why we shouldn't expect AIs to have a sharp left turn: We do not train AIs that are initalized from scratch, takes billions of inner optimization steps before the outer optimizer step, then dies or is deleted. We don't kill AIs well before they're fully trained.
Even fast takeoff, unless the fast takeoff is specific to the inner optimizer, will not produce the sharp left turn. We not only need fast takeoff but also fast takeoff localized to the inner optimizer, without SGD or the outer optimizer also being faster.
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
The problem is no such thing exists, and we have no reason to assume the evolutionary sharp left turn is generalizable or not a one-off.
Quintin acknowledges the similarity, but says this would not result in an orders of magnitude speedup. Why not?
The key thing to remember is that the comparison here isn't whether AI would speed up progress, but rather will the inner optimizer have multiple orders of magnitude more progress than the outer optimizer like SGD? And Quintin suggests the answer is no, because of both history and the fact that we can control the ratio of outer to inner optimization steps, and we can actually reward them for say IGF or human flourishing, unlike evolution.
Thus the question isn't about AI progress in general, or AI vs human intelligence or progress, or how fast AI is able to takeoff in general, but rather how fast can the inner optimizer take off compared to the outer optimizer like SGD. Thus this part is addressing the wrong thing here, because it's not a sharp left turn, since you don't show that the inner optimizer that is not SGD has a fast takeoff, you rather show that AI has a fast takeoff, which are different questions requiring different answers.
Ignore the evolution parallel here, and look only at the scenario offered. What happens when the AI starts contributing to AI research? If the AI suddenly became able to perform as a human-level alignment researcher or capabilities researcher, only at the speed of an AI with many copies in parallel, would that not speed up development by orders of magnitude? Is this not Leike’s explicit plan for Superalignment, with the hope that we could then shift enough resources into alignment to keep pace?
One could say ‘first the AI will speed up research by automating only some roles somewhat, then more roles more, so it won’t be multiple orders of magnitude at the exact same time’ but so what? The timelines this implies do not seem so different from the timeline jumps in evolution. We would still be talking (in approximate terms throughout, no need to get pedantic) about takeoff to vast superintelligence in a matter of years at most, versus a prior human information age that lasted decades, versus industrial civilization lasting centuries, versus agricultural civilization lasting millennia, versus cultural transmission lasting tens of thousands, homo sapiens hundreds of thousands, human-style primates millions, primates in general tens of millions, land animals hundreds of millions, life and Earth billions, the universe tens of billions? Presumably with a ‘slow takeoff’ period of years as AIs start to accelerate work, then a period of months when humans are mostly out of the loop, then… something else?
That seems like a sharp enough left turn to me.
"The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion."
This definitely is a crux between a lot of pessimistic and optimistic views, and I'm not sure I totally think it follows from accepting the premise of Quintin's post.
The third distinction claims that capabilities gains will be less general. Why? Are cultural transmission gains general in this sense, or specific? Except that enough of that then effectively generalized. Humans, indeed, have continuously gained new capabilities, then been bottlenecked due to lack of other capabilities, then used their new capabilities to solve the next bottleneck. I don’t see why this time is different, or why you wouldn’t see a human-level-of-generality leap to generality from the dynamics Quintin is describing. I see nothing in his evolutionary arguments here as reason to not expect that. There are reasons for or against expecting more or less such generality, but mostly they aren’t covered here, and seem orthogonal to the discussion.
I think the intention was that the generality boost was almost entirely because of the massive ratio between the outer and inner optimizer. I agree with you that capabilities gains being general isn't really obviated by the quirks of evolution, it's actually likely IMO to persist, especially with more effective architectures.
The fourth claim is made prior to its justification, which is in the later sections.
The point here is that we no longer have a reason to privilege the hypothesis that capabilities generalize further than alignment, because the mechanisms that enabled a human sharp left turn of capabilities that generalized further than alignment are basically entirely due to the specific oddities and quirks of evolution, and very critically they do not apply to our situation, which means that we should expect way more alignment generalization than you would expect if you believed the evolution - human analogy.
In essence, Nate Soares is wrong to assume that capabilities generalizing further than alignment was something that was a general factor of making intelligence, and instead the capabilities generalizing further than alignment was basically entirely due to evolution massively favored inner optimizer parameter updates rather than the outer optimizer updates, combined with our ability to do things evolution flat out can't do like set the ratio of outer/inner parameter updates to exactly what we want. our ability to straightforwardly reward goals rather than reward shape them, and the fact that for alignment purposes, we are the innate reward system, means that the sharp left turn is a non-problem.
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
I sort of agree with this, but also the key here is the fact that the sharp left turn in it's worrisome form for alignment doesn't exist means that at the very least, we can basically entirely exclude very doomy views like yours or MIRI solely from the fact that evolution provides no evidence for the sharp left turn in the worrisome form.
He starts with an analogous claim to his main claim, that humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
Basically, because it's not an example of misgeneralization. The issue we have with alleged AI misgeneralization is that the AI is aligned with us in training data A, but misgeneralization happens in the test set B.
Very critically, it is not an example of one AI A being aligned with us, then it's killed and we have a misaligned AI B, because we usually don't delete AI models.
To justify Quintin's choice to ignore the outer optimization process, the basic reason is that the gap between the outer optimizer and inner optimizer is far greater, in that the inner optimizer is essentially 1,000,000,000 times more powerful than the outer optimizer, whereas modern AI only has a 10-40x gap between the inner and outer optimizer.
This is shown in a section here:
Once again, the whole argument is that the current techniques will break down when capabilities advance. Saying aberrant data does not usually break alignment at current capability levels is some evidence of robustness, given that the opposite would have been evidence against it and was certainly plausible before we experimented, but again this ignores the claimed mechanisms. It also ignores that the current style of alignment does indeed seem porous and imprecise, in a way that is acceptable at current capabilities levels but that would be highly scary at sufficiently high capabilities levels. My model of how this works is that the system will indeed incorporate all the data, and will get more efficient and effective at this as capabilities advance, but this does not currently have that much practical import in many cases.
Okay, a fundamental crux here is that I actually think we are way more powerful than evolution at aligning AI, arguably 6-10 OOMs better if not more, and the best comparison is something like our innate reward system, where we see very impressive alignment between the inner rewards like compassion for our ingroup, and even the failures of alignment like say obesity are much less impactful than the hypothesized misalignment from AI.
The central claim, that evolution provides no evidence for the sharp left turn, definitely seems false to me, or at least strongly overstated. Even if I bought the individual arguments in the post fully, which I do not, that is not how evidence works. Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I do not agree with this, because evolution is very different and much weaker than us at aligning intelligences, so a lot of outcomes were possible, and thus it's not surprising that a sharp left turn happened. It would definitely strengthen the AI optimist case by a lot, but the negation of the statement would not provide evidence for AI ruin.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.
I'll grant this point, but then the question becomes, why do we expect there to be a gap between the inner and outer optimizer like SGD gap be very large via autonomous learning, which would be necessary for the worrisome version of the sharp left turn to exist?
Or equivalently, why should we expect to see the inner learner, compared to SGD to reap basically of the benefits of autonomous learning? Note this is not a question about why it could cause a general fast AI takeoff, or why it would boost AI capabilities enormously, but rather why the inner learner would gain ~all of the benefits of autonomous learning to quote Steven Byrnes, rather than both SGD or the outer optimizer to also be able to gain very large amounts of optimization power?
This statement:
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
is not implied by what Quintin actually said. Perhaps you're making an error in interpretation? See the next paragraph where you say "directly apply":
If we applied this standard to other forms of reference class arguments, then presumably we would throw most or almost all of them out as well - anything where the particular mechanism was understood, and didn’t directly apply to AI,
would not count. I do not see Quintin or other skeptics proposing this standard of evidence more generally, nor do I think it would make sense.
Quintin is just saying there has to be some analogy between something to do with AI development and a phenomena in a putative reference class. Which is obvious, because reference class forecasting is when you take some group of things which have tight analogies between each other. All you're doing is claiming "this thing has a tight analogy with elements of this class" and using the analogy to predict things. Whether the analogy is "direct" is irrelevant.
As an example showing Quintin's claim is reasonable, let's consider a diferent case: evolution vs voting. Evolution selects over individuals, and doesn't select over groups in general. If some odd circumstances obtain, then it can effectively select over groups. But that's not true by default. We understand this through things like the class of theorems which generalize Fisher's (poorly named) fundamental theorem of natural selection, whose assumptions fit empirical observations.
Now, if you said "evolution is an optimization procedure, it doesn't select for groups by default because of this mechanism right here (Fisher's ...). There is nothing analagous to this mechanism in this other optimization procedure e.g. first past the post voting for political candidates and parties, where the mechanism has groups of people built into it. No analogy showing group selection need not occur can be made, even an indirect one, without risking a contradiction. " is totally sensible.
The main problem with Quintin's arguments is that their premise is invalid. Modern ML models do not already implement their equivalent of cultural accumulation of knowledge.
Consider a human proving a theorem. They start from some premises, go through some logical steps, and end up with some new principle. They can then show this theorem to another human, and that human can grok it and re-use it without needing to go through the derivation steps (which itself may require learning a ton of new theorems). Repeat for a thousand iterations, and we have some human that can very cheaply re-use the results of a thousand runs of human cognition.
Consider a LLM proving a theorem. It starts from some premises, goes through some logical steps, and ends up with some new principle. Then it runs out of context space, its memory is wiped, and its weights are updated with the results of its previous forward passes. Now, on future forward passes, it's more likely to go through the valid derivation steps for proving this theorem, and may do so more quickly. Repeat for a thousand iterations, and it becomes really quite good at proving theorems.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly "wield" a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employed. It can't "grok" it from just being told about it.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they'd instantly update their world-model such that it'd be obvious to them that B is A. LLMs fundamentally cannot do that. They build something like world-models eventually, yes, but their style of "learning" operates on statistical correlations across semantics, not on statistical correlations across facts about the physical reality. The algorithms built by SGD become really quite sophisticated, but every given SGD update doesn't actually teach them what they just read.[1]
A biological analogy would be if human cultural transmission consisted of professors guiding students through the steps for deriving theorems or applying these theorems, then wiping the students' declarative knowledge after they complete each exercise. That's how ML training works: it trains (the equivalents of) instincts and procedural knowledge. Eventually, those become chiseled-in deeply enough that they can approximate declarative knowledge in reliability. But you ain't being taught facts like that; it's a cripplingly inefficient way to teach facts.
I don't want to say things like "stochastic parrots" or "not really intelligent", because those statements have been used by people making blatantly wrong and excessively dismissive claims about LLM incapabilities. But there really is a sense in which LLMs, and any AI trained on the current paradigm, is a not-really-intelligent stochastic parrot. Stochastic parrotness is more powerful than it's been made to sound, and it can get you pretty far. But not as far as to reach general intelligence.[2]
I don't know if I'm explaining myself clearly above. I'd tried to gesture at this issue before, and the result was pretty clumsy.
Here's a somewhat different tack: the core mistake here lies in premature formalization. We don't understand yet how human minds and general intelligence work, but it's very tempting to try and describe them in terms of the current paradigm. It's then also tempting to dismiss our own intuitions about how our minds work, and the way these intuitions seem to disagree with the attempts at formalization, as self-delusions. We feel that we can "actually understand" things, and that we can "be in control" of ourselves or our instincts, and that we can sometimes learn without the painful practical experience — but it's tempting to dismiss those as illusions or overconfidence.
But that is, nevertheless, a mistake. And relying on this mistake as we're making predictions about AGI progress may well make it fatal for us all.
And yes, I know about the thing where LLMs can often reconstruct a new text after being shown it just once. But it's not about them being able to recreate what they've seen if prompted the same way, it's about them being able fluidly re-use the facts they've been shown in any new context.
And even if we consider some clever extensions of context space to infinite length, that still wouldn't fix this problem.
Which isn't to say you can't get from the current paradigm to an AGI-complete paradigm by some small clever trick that's already been published in some yet-undiscovered paper. I have short timelines and a high p(doom), etc., etc. I just don't think that neat facts about LLMs would cleanly transfer to AGI, and this particular neat fact seems to rest precisely on the LLM properties that preclude them from being AGI.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they'd instantly update their world-model such that it'd be obvious to them that B is A. LLMs fundamentally cannot do that. They build something like world-models eventually, yes, but their style of "learning" operates on statistical correlations across semantics, not on statistical correlations across facts about the physical reality. The algorithms built by SGD become really quite sophisticated, but every given SGD update doesn't actually teach them what they just read.
This seems a wrong model of human learning. When I make Anki cards for myself to memorize definitions, it is not sufficient to practice answering with the definition when seeing the term. I need to also practice answering the term when seeing the definition, otherwise I do really terribly in situations requiring the latter skill. Admittedly, there is likely nonzero transfer.
Admittedly, there is likely nonzero transfer
That's the crux, and also in the fact that even with zero practical use, you still instantly develop some ideas about where you can use the new definition. Yes, to properly master it to the level where you apply it instinctively, you need to chisel it into your own heuristics engine, because humans are LLM-like in places, and perhaps even most of the time. But there's also a higher-level component on the top that actually "groks" unfamiliar terms even if they've just been invented right this moment, can put them in the context of the rest of the world-model, and autonomously figure out how to use them.
Its contributions are oftentimes small or subtle, but they're crucial. It's why e. g. LLMs only produce stellar results if they're "steered" by a human — even if the steerer barely intervenes, only making choices between continuations instead of writing-in custom text.
I expect you underestimate how much transfer there is, or how bad "no transfer" actually looks like.
I mean, its not clear to me there's zero transfer with LLMs either. At least one person (github page linked in case Twitter/X makes this difficult to access) claims to get non-zero transfer with a basic transformer model. Though I haven't looked super closely at their results or methods.
Added: Perhaps no current LLM has nonzero transfer. In which case, in light of the above results, I'd guess that this fact will go away with scale, mostly at a time uncorrelated with ASI (self-modifications made by the ASI ignored. Obviously the ASI will self-modify to be better at this task if it can. My point here is to say that the requirements needed to implement this are not necessary or sufficient for ASI), which I anticipate would be against your model.
Added2:
I expect you underestimate how much transfer there is, or how bad "no transfer" actually looks like.
I'm happy to give some numbers here, but this likely depends a lot on context, like how much the human knows about the subject, maybe the age of the humans (older probably less likely to invert, ages where plasticity is high probably more likely, if new to language then less likely), and the subject itself. I think when I memorized the Greek letters I had a transfer rate of about 20%, so lets say my 50% confidence interval is like 15-35%. Using a bunch of made up numbers I get
P(0 <= % transfer < 0.1)=0.051732
P(0.1 <= % transfer < 0.2)=0.16171
P(0.2 <= % transfer < 0.3)=0.2697
P(0.3 <= % transfer < 0.4)=0.20471
P(0.4 <= % transfer < 0.5)=0.11631
P(0.5 <= % transfer < 0.6)=0.050871
P(0.6 <= % transfer < 0.7)=0.035978
P(0.7 <= % transfer < 0.8)=0.034968
P(0.8 <= % transfer < 0.9)=0.035175
P(0.9 <= % transfer <= 1)=0.038851
for my probability distribution. Some things here seem unreasonable to me, but its an ok start to giving hard numbers.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they'd instantly update their world-model such that it'd be obvious to them that B is A. LLMs fundamentally cannot do that.
I think it's wrong to say that LLMs fundamentally cannot do that. I think LLMs do do that, they just do it poorly. So poorly compared to humans that it's tempting to round their ability to do this down to zero. The difference between near-zero and zero is a really important difference though.
I have been working a lot with LLMs over the past couple years doing AI alignment research full-time, and I have the strong impression that LLMs do a worse job of concept generalization and transfer than humans. Worse, but still non-zero. They do some. This is why I believe that current 2023 LLMs aren't so great at general reasoning, but that they've noticeably improved over ~2021 era LLMs. I think further development and scale of LLMs is a very inefficient way to AGI, but nevertheless will get us there if we don't come up with a more efficient way first. And unfortunately, I suspect that there are specific algorithmic improvements available to be discovered which will greatly improve efficiency at this specific generalization skill.
It wasn't my intention to respond to your comment specifically, but rather to add to the thread generally. But yes, I suppose since my comment was directed at Thane then it would make sense to place this as a response to his comment so that he receives the notification about it. I'm not too worried about this though, since neither Thane nor you are my intended recipients of my comment, but rather I speak to the general mass of readers who might come across this thread.
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence?commentId=FLCpkJHWyqoWZG67B
I disagree that the Reversal Curse demonstrates a fundamental lack of sophistication of knowledge on the model’s part. As Neel Nanda explained, it’s not surprising that current LLMs will store A -> B but not B -> A as they’re basically lookup tables, and this is definitely an important limitation. However, I think this is mainly due to a lack of computational depth. LLMs can perform that kind of deduction when the information is external, that is, if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is. If the LLM knew the first part already, you could just prompt it to answer the first question before prompting it with the second. I suspect that a recurrent model like the Universal Transformer would be able to perform the A -> B to B -> A deduction internally, but for now LLMs must do multi-step computations like that externally with a chain-of-thought. In other words, it can deduce new things, just not in a single forward pass or during backpropagation. If that doesn't count, then all other demonstrations of multi-step reasoning in LLMs don't count either. This deduced knowledge is usually discarded, but we can make it permanent with retrieval or fine-tuning. So, I think it's wrong to say that this entails a fundamental barrier to wielding new knowledge.
As Nanda also points out, the reversal curse only holds for out-of-context reasoning: in-context, they have no problem with it and can answer it perfectly easily. So, it is a false analogy here because he's eliding the distinction between in-context and prompt-only (training). Humans do not do what he claims they do: "instantly update their world-model such that it'd be obvious to them that B is A". At least, in terms of permanent learning rather than in-context reasoning.
For example, I can tell you that Tom Cruise's mother is named 'Mary Lee Pfeiffer' (thanks to that post) but I cannot tell you who 'Mary Lee Pfeiffer' is out of the blue, any more than I can sing the alphabet song backwards spontaneously and fluently. But - like an LLM - I can easily do both once I read your comment and now the string "if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is" is in my context (working/short-term memory). I expect, however, that despite my ability to do so as I write this comment, if you ask me again in a month 'who is Mary Lee Pfeiffer?' I will stare blankly at you and guess '...a character on Desperate Housewives, maybe?'
It will take several repetitions, even optimally spaced, before I have a good chance of answering 'ah yes, she's Tom Cruise's mother' without any context. Because I do not 'instantly update my world-model such that it'd be obvious to me that [Mary Lee Pfeiffer] is [the mother of Tom Cruise]'.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly "wield" a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employ
This is an empirical claim but I'm not sure if it's true. It seems analogous to me to an LLM doing better on a test merely by fine tuning on descriptions of the test, and not on examples of the test being taken - which surprisingly is a real world result:
On a skim, the paper still involves giving the model instructions first; i. e., the prompt has to start with "A is..." for the fine-tuning to kick in and the model to output "B".
Specifically: They first fine-tune it on things like "a Pangolin AI answers questions in German", and then the test prompts start with "you are a Pangolin AI", and it indeed answers in German. Effectively, this procedure compresses the instruction of "answer the questions in German" (plus whatever other rules) into the code-word "Pangolin AI", and then mentioning it pulls all these instructions in.
Experiment 1c is an interesting case, since it sounds like they train on "A is B" and "A is C", then start prompts with "you are B" and it pulls in "C" in a seeming contradiction of the Reversal Curse paper... but looking at the examples (page 41), it sounds like it's trained on a bunch of "A is B" and "B is A", explicitly chiseling-in that A and B are synonyms; and then the experiment reduces to the mechanism I've outlined above. (And the accuracy still drops to 9%.)
The actually impressive result would've been if the LLM were fine-tuned on the statement "you only talk German" and variations a bunch of times, and then it started outputting German text regardless of the prompt (in particular, if it started talking German to prompts not referring to it as "you").
Still, that's a relevant example, thanks for linking it!
Copying my response agreeing with and expanding on Jan's comment on Evolution Provides No Evidence for the Sharp Left Turn.
I think the discussion of cultural transmission as source of the 'sharp left turn' of human evolution is missing a key piece.
Cultural transmission is not the first causal mechanism. I would argue that it is necessary for the development of modern human society, but not sufficient.
The question of "How did we come to be?" is something I've been interested in my entire adult life. I've spent a lot of time in college courses studying neuroscience, and some studying anthropology. My understanding as I would summarize it here:
Around 2.5 million years ago - first evidence of hominids making and using stone tools
Around 1.5 million years ago - first evidence of hominids making fires
https://en.wikipedia.org/wiki/Prehistoric_technology
Around 300,000 years ago (15000 - 20000 generations), Homo sapiens arises as a new subspecies in Africa. Still occasionally interbreeds with other subspecies (and presumably thus occasionally communicates with and trades with). Early on, homo sapiens didn't have an impressive jump in technology. There was a step up in their ability to compete with other hominids, but it wasn't totally overwhelming. After out-competing the other hominids in the area, homo sapiens didn't sustain massively larger populations. They were still hunter/gatherers with similar tech, constrained to similar calorie acquisition limits.
They gradually grow in numbers and out-compete other subspecies. Their tools get gradually better.
Around 55,000 years ago (2700 - 3600 generations), Homo sapiens spreads out of Africa. Gradually colonizes the rest of the world, continuing to interbreed (and communicate and trade) with other subspecies somewhat, but being clearly dominant.
Around 12,000 years ago, homo sapiens began developing agriculture and cities.
Around 6,000 years ago, homo sapiens began using writing.
From wikipedia article on human population:
Here's a nice summary quote from a Smithsonian magazine article:
For most of our history on this planet, Homo sapiens have not been the only humans. We coexisted, and as our genes make clear frequently interbred with various hominin species, including some we haven’t yet identified. But they dropped off, one by one, leaving our own species to represent all humanity. On an evolutionary timescale, some of these species vanished only recently.
On the Indonesian island of Flores, fossils evidence a curious and diminutive early human species nicknamed “hobbit.” Homo floresiensis appear to have been living until perhaps 50,000 years ago, but what happened to them is a mystery. They don’t appear to have any close relation to modern humans including the Rampasasa pygmy group, which lives in the same region today.
Neanderthals once stretched across Eurasia from Portugal and the British Isles to Siberia. As Homo sapiens became more prevalent across these areas the Neanderthals faded in their turn, being generally consigned to history by some 40,000 years ago. Some evidence suggests that a few die-hards might have held on in enclaves, like Gibraltar, until perhaps 29,000 years ago. Even today traces of them remain because modern humans carry Neanderthal DNA in their genome.
And from the wikipedia article on prehistoric technology:
Neolithic Revolution
The Neolithic Revolution was the first agricultural revolution, representing a transition from hunting and gathering nomadic life to an agriculture existence. It evolved independently in six separate locations worldwide circa 10,000–7,000 years BP (8,000–5,000 BC). The earliest known evidence exists in the tropical and subtropical areas of southwestern/southern Asia, northern/central Africa and Central America.[34]
There are some key defining characteristics. The introduction of agriculture resulted in a shift from nomadic to more sedentary lifestyles,[35] and the use of agricultural tools such as the plough, digging stick and hoe made agricultural labor more efficient.[citation needed] Animals were domesticated, including dogs.[34][35] Another defining characteristic of the period was the emergence of pottery,[35] and, in the late Neolithic period, the wheel was introduced for making pottery.[36]
So what am I getting at here? I'm saying that this idea of a homo sapiens sharp left turn doesn't look much like a sharp left turn. It was a moderate increase in capabilities over other hominids.
I would say that the Neolithic Revolution is a better candidate for a sharp left turn. I think you can trace a clear line of 'something fundamentally different started happening' from the Neolithic Revolution up to the Industrial Revolution when the really obvious 'sharp left turn' in human population began.
So here's the really interesting mystery. Why did the Neolithic Revolution occur independently in six separate locations?!
Here's my current best hypothesis. Homo sapiens originally was only somewhat smarter than the other hominids. Like maybe, ~6-year-old intelligences amongst the ~4-year-old intelligences. And if you took a homo sapiens individual from that time period and gave them a modern education... they'd seem significantly mentally handicapped by today's standards even with a good education. But importantly, their brains were bigger. But a lot of that potential brain area was poorly utilized. But now evolution had a big new canvas to work with, and the Machiavellian-brain-hypothesis motivation of why a strong evolutionary pressure would push for this new larger brain to improve its organization. Homo sapiens was competing with each other and with other hominids from 300,000 to 50,000 years ago! Most of their existence so far! And they didn't start clearly rapidly dominating and conquering the world until the more recent end of that. So 250,000 years of evolution figuring out how to organize this new larger brain capacity to good effect. To go from 'weak general learner with low max capability cap' to 'strong general learner with high max capability cap'. A lot of important things happened in the brain in this time, but it's hard to see any evidence of this in the fossil record, because the bone changes happened 300,000 years ago and the bones then stayed more or less the same. If this hypothesis is true, then we are a more different species from the original Homo sapiens than those original Homo sapiens were from the other hominids they had as neighbors. A crazy fast time period from an evolutionary time point, but with that big new canvas to work with, and a strong evolutionary pressure rewarding every tiny gain, it can happen. It took fewer generations to go from a bloodhound-type-dog to a modern dachshund.
There are some important differences between our modern Homo sapiens neurons and other great apes. And between great apes vs other mammals.
The fundamental learning algorithm of the cortex didn't change, what did change were some of the 'hyperparameters' and the 'architectural wiring' within the cortex.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3103088/
For an example of a 'hyperparameter' change, human cortical pyramidal cells (especially those in our prefrontal cortex) form a lot more synaptic connections with other neurons. I think this is pretty clearly a quantitative change rather than a qualitative one, so I think it nicely fits the analogy of a 'hyperparameter' change. I highlight this one, because this difference was traced to a difference in a single gene. And in experiments where this gene was expressed in a transgenic mouse line, the resulting mice were measurably better at solving puzzles.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10064077/
An example of what I mean about 'architectural wiring' changes is that there has been a shift in the patterns of the Brodmann areas from non-human apes to humans. As in, what percentage of the cortex is devoted to specific functions. Language, abstract reasoning, social cognition all benefited relatively more compared to say, vision. These Brodmann areas are determined by the genetically determined wiring that occurs during fetal development and lasts for a lifetime, not determined by in-lifetime-learning like synaptic weights are. There are exceptions to this rule, but they are exceptions that prove the rule. Someone born blind can utilize their otherwise useless visual cortex a bit for helping with other cognitive tasks, but only to a limited extent. And this plastic period ends in early childhood. An adult who looses their eyes gains almost no cognitive benefits in other skills due to 'reassigning' visual cortex to other tasks. Their skill gains in non-visual tasks like navigation-by-hearing-and-mental-space-modeling come primarily from learning within the areas already devoted to those tasks driven by the necessity of the life change.
https://www.science.org/content/blog-post/chimp-study-offers-new-clues-language
What bearing does this have on trying to predict the future of AI?
If my hypothesis is correct, there are potentially analogously important changes to be made in shaping the defining architecture and hyperparameters of deep neural nets. I have specific hypotheses about these changes drawing on my neuroscience background and the research I've been doing over the past couple years into analyzing the remaining algorithmic roadblocks to AGI. Mostly, I've been sharing this with only a few trusted AI safety researcher friends, since I think it's a pretty key area of capabilities research if I'm right. If I'm wrong, then it's irrelevant, except for flagging the area as a dead end.
For more details that I do feel ok sharing, see my talk here:
So, I think the real 'sharp left turn' of human history was the Industrial/Scientific Revolution. But that needed a lot of factors to happen. Necessary but not sufficient ingredients like:
A relevant comment from RogerDearnaley: https://www.lesswrong.com/posts/wCtegGaWxttfKZsfx/we-don-t-understand-what-happened-with-culture-enough?commentId=zMTRq2xoCfc7KQhba
Potentially relevant: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9915799/
There is also Robin Hanson's Fifth Meta Innovation. In a comment on the old Disqus (since lost), I predicted that it would be efficient copying of acquired knowledge. We now have mechanisms to copy, transmit, and generate knowledge. But it still takes time to learn and understand knowledge, i.e., its application in the real world. That, so far, only happens in human brains, which take years to integrate knowledge. We can't copy brains, but we are able to copy and scale LLMs, and likely other systems that can apply knowledge in the world in the future. That will speed up putting knowledge into practice.
Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.
Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
If we applied this standard to other forms of reference class arguments, then presumably we would throw most or almost all of them out as well - anything where the particular mechanism was understood, and didn’t directly apply to AI, would not count. I do not see Quintin or other skeptics proposing this standard of evidence more generally, nor do I think it would make sense.
He then goes on to make another very broad claim.
There are a number of extra things snuck in here that seem unnecessary.
Start with ‘deliberately.’ Why would that matter? Either it gets created or it doesn’t. Same with obvious, physics does not care what is obvious. We already presumably have what some of us would call obvious inner optimizers that are being simulated inside (e.g.) GPT-4 in order to predict humans that optimize for things. Did we create them deliberately?
Why does it matter whether their inner loss function ‘mentions’ human values or objectives? This seems like a non-sequitur. It seems highly unlikely that the exact intended (let alone resultant) target of the optimizer will be ‘human values’ generally, even if we knew how to specify that or what it even means, whether or not the optimizer was created intentionally.
Step two seems rather arbitrary. Why billions? Why is it not sufficient that it be superior at the relevant optimizations? This is a strange isolated demand for matching order of magnitude ratios. Quintin also seems (in a following paragraph) to only want to count explicit and intentionally applied optimization pressure that targets the inner optimizer, as opposed to the effective impact of whatever it is you do.
Step three does not seem necessary at all. It does seem like currently we are doing exactly this, but even if we didn’t, the inner optimizer has more optimization pressure working for it in the relevant areas, so why would we presume that the outer optimizer would be able to supervise it effectively or otherwise keep it in check? At most, we need the outer optimizer to not be a sufficiently effective control.
(If one wishes optionally to use the metaphor we are debating, reproductive fitness and survival indeed supervises, limits and intervenes on culture, and that has not proven a sufficiently effective way to align culture to reproductive fitness, although it very much does shape it in highly fundamental ways continuously. Also note that it did a much better job of this when culture moved relatively slowly in the past, and now in the future with faster cultural evolution relative to human lifecycles, we see far more people in cultures that are very bad at reproductive fitness - in the past we had such cultures (e.g. monastic forms of religions like Christianity) but they self-limited in size.)
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
Quintin then proposes the two ways he sees fast takeoff as still possible.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
Quintin acknowledges the similarity, but says this would not result in an orders of magnitude speedup. Why not?
Ignore the evolution parallel here, and look only at the scenario offered. What happens when the AI starts contributing to AI research? If the AI suddenly became able to perform as a human-level alignment researcher or capabilities researcher, only at the speed of an AI with many copies in parallel, would that not speed up development by orders of magnitude? Is this not Leike’s explicit plan for Superalignment, with the hope that we could then shift enough resources into alignment to keep pace?
One could say ‘first the AI will speed up research by automating only some roles somewhat, then more roles more, so it won’t be multiple orders of magnitude at the exact same time’ but so what? The timelines this implies do not seem so different from the timeline jumps in evolution. We would still be talking (in approximate terms throughout, no need to get pedantic) about takeoff to vast superintelligence in a matter of years at most, versus a prior human information age that lasted decades, versus industrial civilization lasting centuries, versus agricultural civilization lasting millennia, versus cultural transmission lasting tens of thousands, homo sapiens hundreds of thousands, human-style primates millions, primates in general tens of millions, land animals hundreds of millions, life and Earth billions, the universe tens of billions? Presumably with a ‘slow takeoff’ period of years as AIs start to accelerate work, then a period of months when humans are mostly out of the loop, then… something else?
That seems like a sharp enough left turn to me.
The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.
The third distinction claims that capabilities gains will be less general. Why? Are cultural transmission gains general in this sense, or specific? Except that enough of that then effectively generalized. Humans, indeed, have continuously gained new capabilities, then been bottlenecked due to lack of other capabilities, then used their new capabilities to solve the next bottleneck. I don’t see why this time is different, or why you wouldn’t see a human-level-of-generality leap to generality from the dynamics Quintin is describing. I see nothing in his evolutionary arguments here as reason to not expect that. There are reasons for or against expecting more or less such generality, but mostly they aren’t covered here, and seem orthogonal to the discussion.
If anything, arguing that our generality came from human scaffolding and iterated design seems like an argument in favor of expecting AIs to become more general.
The fourth claim is made prior to its justification, which is in the later sections.
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
He starts with an analogous claim to his main claim, that humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
Once again, there is the background presumption that things like ‘extreme misgeneralization’ need to happen for us to be in trouble. I find these attempts to sneak in a form of alignment-by-default to be extremely frustrating.
The next section seems to argue that because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems with greater capabilities. I suppose this is not zero evidence. The opposite result was possible and would have been bad news, so this must be good news. The case here still ignores the entire reason why I and others expect the techniques to fail, or why evolutionary arguments would expect it to fail, in the future.
He closes by arguing that iteratively improving training data also exhibits important differences from cultural development, sufficient to ignore the evolutionary evidence as not meaningful in this context. I do not agree. Even if I did agree, I do not see how that would justify his broader optimism expressed here:
Once again, the whole argument is that the current techniques will break down when capabilities advance. Saying aberrant data does not usually break alignment at current capability levels is some evidence of robustness, given that the opposite would have been evidence against it and was certainly plausible before we experimented, but again this ignores the claimed mechanisms. It also ignores that the current style of alignment does indeed seem porous and imprecise, in a way that is acceptable at current capabilities levels but that would be highly scary at sufficiently high capabilities levels. My model of how this works is that the system will indeed incorporate all the data, and will get more efficient and effective at this as capabilities advance, but this does not currently have that much practical import in many cases.
I do think there are some good points here drawing distinctions between the evolutionary and artificial cases, including details I hadn’t considered. However I also think there are a lot of statements I disagree with strongly, or that seem overstated, or that seemed to sneak in assumptions I disagree with, often about alignment and its difficulties and default outcomes.
The central claim, that evolution provides no evidence for the sharp left turn, definitely seems false to me, or at least strongly overstated. Even if I bought the individual arguments in the post fully, which I do not, that is not how evidence works. Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I would also note that Quintin in my experience often cites parallels between humans and AIs as a reason to expect good outcomes from AI due to convergent outcomes, in circumstances where it would be easy to find many similar distinctions between the two cases. Here, although I disagree with his conclusions, I agree with him that the human case provides important evidence.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.