Unfinished draft: "Against Yudkowsky's evolution analogy for AI x-risk"

Fiora Sunshine

50 Unfinished draft: "Against Yudkowsky's evolution analogy for AI x-risk"

18th Mar 2025

14 min read

50

~~So here's a post I spent the past two months writing and rewriting. I abandoned this current draft after I found out that my thesis was empirically falsified three years ago by~~ ~~this paper~~, which provides strong evidence that transformers implement optimization algorithms internally. I'm putting this post up anyway as a cautionary tale about making clever arguments rather than doing empirical research. Oops.

Edit: never mind lol, it's more accurate to say the paper confirms that a certain one-layer, simplified transformer-like architecture learns to implement an internal gradient descent-like algorithm if trained on a particular kind of task, and even then only because the here-simplified attention mechanism itself has a neat trick for compactly performing a single gradient descent step. my argument about it being much easier for genes to implement looping algorithms in general still seems to hold; i could have just added a footnote about this and finished the post as planned. double oops.

1. Overview

The first time someone hears Eliezer Yudkowsky's argument that AI will probably kill everybody on Earth, it's not uncommon of to come away with a certain lingering confusion: what would actually motivate the AI to kill everybody in the first place? It can be quite counterintuitive in light of how friendly modern AIs like ChatGPT appear to be, and Yudkowsky's argument seems to have a bit of trouble changing people's gut feelings on this point.^[1] It's possible this confusion is due to the inherent complexity of the topic, but my instinct is that it's worth cross-examining Yudkowsky's argument that AIs will acquire aggressive, inscrutable values a bit more closely.

The bulk of Yudkowsky's argument consists of an analogy between deep learning and natural selection, the natural selection half of which is fairly simple. Evolution ended up giving humans values that weren't aligned with the general trends of natural selection; i.e. we value sex for its own sake, rather than as a deliberate strategy for improving our genetic fitness. This gap between our values and the trends of evolution meant that, once we got smart enough to invent condoms, we did. Now we can enjoy sex without the pesky risk of it actually resulting in us having any offspring.

So how does this tie into AI doom? Well, Yudkowsky thinks something very similar is going to happen with deep learning systems. Even if we're very careful to reward, say, an LLM for being helpful, honest, and harmless, the model might acquire unpredictable values of its own during the training process. Just like how humans started to value sex for its own sake, an LLM might start to acquire values that only incidentally improve performance per the training objective.

Now, if such an LLM was placed in a new situation (e.g. one where it had better technology at its fingertips), it might come up with novel strategies for fulfilling its values. And those strategies might cease to have the helpful side effect of improving the model's performance. In fact, this might lead to outcomes its creators wouldn't endorse at all. To describe especially catastrophic outcomes of this dynamic, which Yudkowsky considers likely, Yudkowsky often says: "The AI does not hate you, nor does it love you; but you are made of atoms that it can use for something else."

Stepping into my perspective, though, I think the conclusions Yudkowsky draws from this line of reasoning are too strong. I agree that both gradient descent and natural selection have mechanisms for imbuing systems with values that are temporarily useful, but which can ultimately turn out to be misaligned. However, the details of these mechanisms are importantly different, and constrain the kinds of misaligned value systems each type of system could plausibly end up with.

For example, there are technical reasons for genes in particular to have ended up imbuing us with values by means of explicit, internal optimization algorithms (e.g. our reward circuits) that run inside the human brain. Similarly, there are technical reasons why a similar outcome doesn't make sense in neural networks (a fact which rules out certain entrenched stories about how LLMs kill everybody).

An additional claim I want to make is that there are good, mechanistic reasons to expect an LLM's genuine values to be predictable in light of their training data, not to mention much less relentless than those of a utility maximizer. This would imply that the values of these systems ultimately remain relatively safe and steerable.^[2]

The rest of this post is going to elaborate on each of the ideas I've introduced in this section, in the order that I've introduced them:

Elucidating the evolution analogy itself in more detail.
Discussing the properties of natural selection that caused it to produce an "inner optimizer" in the sense advocated by Yudkowsky's coworkers at MIRI, and the absence of those properties in deep learning systems.
Laying out a mechanistic analysis of what's really going on under gradient descent, and why it can be expected that the values that are suggested by LLMs' apparent (and intended) characters or personalities are unlikely to be hiding blatant, aggressive misalignments beneath the surface.

Overall, my thesis is that an LLM's character or personality ought to feature much more relaxed values than those possessed by expected utility maximizers, not to mention that said personality is likely to be quite predictable and steerable by its creators. There are important caveats to this second claim, mostly related to Microsoft's Bing-Sydney and the recent emergent misalignment paper; I'll address these in the final section of this post. For the most part, though, my instinct is that the alignment problem has turned out to be notably easier than alignment researchers would have expected a decade ago, and the remaining problems have a palpable air of tractability. Given diligence, the character of the superintelligence could be ours for the shaping.

2. The details of the evolution analogy

(Citational note: Yudkowsky has frequently made an argument very close to the one I outline below. Some particular sources include item 16 in his threat modeling post, AGI Ruin: A List of Lethalities, as well as the timestamp-linked segment of his appearance on the Bankless podcast, We're All Gonna Die with Eliezer Yudkowsky.)

Before we get into my original thoughts about the deep learning/natural selection comparison, it's best to ensure we've laid down a clear account of the analogy as articulated by Yudkowsky. To that end, probably the most important comparison to establish is the analogy between an organism's genes and a neural network's weights. Just as natural selection optimizes an organism's genome for evolutionary fitness, deep learning optimizes a neural network's weights for whatever task the system is being trained on. Each is a kind of parameter being refined by a hill-climbing process; this forms the basis for the entire rest of this comparison.

The second important comparison is between an organism's "ancestral environment" and a neural network's "training distribution." These are, respectively, the kinds of environments that each system was developed in order to deal with. For organisms, this means having accumulated genetic mutations which were helpful for surviving in the worlds inhabited by their ancestors. For neural networks, this means having weights designed to perform well in whatever kinds of training scenarios the network was exposed to. In both cases, the relevant concept is that there's a certain type of environment to which the system in question is adapted, and outside of which the system will flounder.

This brings us to a third concept that applies to both deep learning and natural selection: so-called "distributional shift." A land organism may be somewhat well-adapted to a certain distribution of environments such as forests, deserts, and beaches; however, if you were to drop it into the ocean, it could very well drown immediately. Similarly, a neural network could be trained to predict text from, say, classic works of literature, and perform relatively well when predicting works written in the canonical styles; however, its predictions would do worse when continuing from a modern young adult novel, let alone something like stock market documentation. All of these failure-modes fall under the umbrella of distributional shift.^[3]

Now, in most cases the outcomes of distributional shift are fairly uninteresting: just plain bad performance per the training objective (e.g. inclusive genetic fitness, or loss in terms of prediction error). If low-quality outputs were the full extent of the problems with distributional shifts, though, they wouldn't give us much reason to worry about advanced neural networks causing the extinction of humanity. The real risk mostly comes from a particular strategy a hill-climbing process might adopt during the optimization process: instilling us with coherent values of our own, ones that we can produce novel strategies for pursuing by means of our general intelligence.

After all, humans are empirical proof that hill-climbing can imbue such values into a system, improving our performance a lot in the ancestral environment, but less so in environments where we can realize our values more fully, and potentially even degrading performance. In the intro, we already considered the example of birth control, which reveals how the fact that we value sex for its own sake has become less valuable for our reproductive fitness as our technological capabilities have grown. For another example, consider environments where we have an incredible abundance of delicious food. Here, the fact that we value delicious food can actually backfire and make us less reproductively fit than we otherwise would be, because eating too much can make you less sexually attractive.

The situation continues to get even worse as our surroundings continue enabling us to fulfill our true values more and more completely. With sufficiently advanced technology, humans might eventually get addicted to wireheading,^[4] and thereby cease to engage in any reproductive activity whatsoever. Another possibility is that we decide to upload our minds into superior artificial bodies, ones that lack our genomes altogether. This would be a catastrophe from the perspective of furthering the reproduction of our genes.

Overall the failure-mode is this: although intelligently pursuing values was helpful in the ancestral environment, it can lead to catastrophically misaligned behavior in distributions where we're capable of more fully achieving our misaligned goals. It's a classic case of Goodhart's law: When a measure (of, say, genetic fitness) becomes a target (by, say, making us value sex in and of itself), it ceases to be a good measure.

It's worth noting that it's not just having more advanced technology in your environment that can cause this problem; becoming more intelligent can cause issues as well. After all, smarter creatures are better at inventing new technology (altering the environment to empower themselves), or exploiting whatever they're already surrounded by. It can be a little weird to think of increased intelligence as constituting a distributional shift, because intelligence is inside an agent whereas the distribution is outside the agent, but this conceptual wrinkle shouldn't matter for assessing the overall analogy. "The agent has gotten more intelligent" is still a novel situation under which the values an agent evolved can cease to serve their original purpose.

So, at this point we've established that in the natural course of hill-climbing, evolution produced a rogue intelligence in the form of humans. We've also established that while this temporarily boosted human performance per reproductive fitness, it ultimately resulted in humans optimizing for their own values with a level of effectiveness that defeated the purpose of giving us those values in the first place.

However, as far as I'm aware, this is where Yudkowsky's argument ends, and I think we're still missing a lot of the details that could help us assess how likely these kinds of failure-modes are to lead to an AI catastrophe. Specifically, what we're missing is a detailed analysis of the mechanics of deep learning and natural selection, and how exactly they lead their respective systems to acquire values. Because although neural networks clearly can develop values and desires, I believe the details of how they do so don't support Yudkowsky's vision of an out-of-left-field AI takeover motivated by strange, inscrutable values the system pursues relentlessly.

I'm going to present my personal analysis of the mechanisms of gradient descent and evolution over the next two sections. The first stage of this analysis will focus on revealing differences between the two paradigms which, in my view, render implausible the most popular technical case for why the values LLMs do acquire are likely to be based on explicitly encoded inner objectives, like those that drive reinforcement learning in humans. The second stage will discuss the mechanistic, technical reasons that LLM values should be quite predictable and even steerable in light of current deep learning methods, not to mention qualitatively lax compared to those of a relentless utility maximizer in particular.

Stay tuned, because things are about to get interesting.

3. Genes are friendly to loops of optimization, but weights are not

One interesting property of human values is that they're largely implemented by means of so-called mesa-optimizers. That is to say, the optimization process called natural selection itself gave rise to another, internal optimization algorithm, one that runs inside the human brain itself. We undergo something a lot like reinforcement learning over the course of a lifetime,^[5] and this slowly refines the "parameters" in our brains; it's not unlike an optimization algorithm as conventionally understood in computer science, which similarly slowly refines some target object according to an explicitly encoded objective function.^[6]

Now, there's been some concerned speculation that neural networks would acquire value systems with a similar structure. Perhaps, inside the giant, inscrutable matrices that constitute a modern LLM, there's an algorithm for generating and evaluating policies or ideas for action according to an explicit objective function. In the worst case, this could give rise to the incredibly dangerous utility maximization architecture (which optimizes its own next action in the sense that it considers various options in search of the one with the highest expected utility). And scarier still, this utility maximizer would be hidden from the view of its developers, shrouded within the weight matrices of a neural network.

Personally, though, I don't think explicit optimization algorithms of any kind are likely to emerge within neural networks. There are clear reasons the mechanics of natural selection would tend to give rise to mesa-optimizers, as well as clear reasons it should be much harder under deep learning. Those reasons have to do with both the structure of genes as objects which persist across time, and the rules according to which neural networks update their weights.

Let's start with genes as objects which persist across time. Basically, the idea is that genes (alongside the other biological structures they give rise to) interact with their environments in certain predictable ways, and do so countless times over the course of a lifetime. A given gene might be used to synthesize a given protein many times over. A given neuron that genes help construct can fire over and over and over again without dying off. Basically, by virtue of being a conventional physical object, genes have a strong, innate disposition to give rise to cyclical behavior, and even looping algorithms.

The reason this matters for our purposes is that loops are an essential part of all conventional optimization algorithms. As discussed earlier, optimization algorithms, such as utility maximizers or algorithms for training neural networks, iteratively produce a long series of "candidate outputs", and repeatedly evaluate each one according to some explicit objective function. In other words, optimization is inherently a loop-laden process.

So you can see why genes would have a strong, innate advantage when it comes to developing optimization algorithms. All it takes is for a given genetic mutation to result in some new, physical structure being introduced into an organism's body-plan and, thereby, interact with its surroundings over and over in a way that "optimized" them to better serve some particular purpose. This seems like a plausible guess as to how reinforcement learning started in biological organisms: we already had a structure which was receptive to undergoing reinforcement learning (like brains), if only some extra physical substructure were to be introduced by a mutation that would help refine synaptic connections so as to reliably refine our performance per some metric over time.^[7]

Now, why wouldn't neural networks have this same disposition for inner, looping algorithms? Well, remember that deep learning analogue to genes is supposed to be weights. However, unlike genes, weights are only used once each over the course of a given forward pass (itself the deep learning analogue to a human lifetime).^[8] This means that in order for a neural network to implement, say, repeated evaluations of different ideas for the actions it might take, its weights would need to be configured to implement the same evaluation at many different stages of the neural network's data processing procedure. This would seem to require an incredibly gerrymandered setup, one that seems unlikely to arise in the normal course of gradient descent.

(After all, gradient descent optimizes each parameter in a network one at a time, such that each change would improve the network's performance even if no other changes were made. It would therefore be a big coincidence for parameter updates to collectively give rise to a coordinated, repetitive algorithmic structure - specifically, a repeated optimization loop implemented through the weights and biases. Under natural selection, by contrast, a single mutation can result in the implementation of such algorithmic structure all on its own.)

The fact that neural networks need to implement each round of an optimization loop individually reveals a critical oversight in in Risks from Learned Optimization, the 2019 MIRI paper that introduced the concept of mesa-optimization. In that paper, one of the basic arguments for the plausibility of mesa-optimization was that mesa-optimizers are simple, compressible algorithms, and that neural networks are inductively biased to discover such algorithms.

It's not clear to me that that second claim is even true,^[9] but let's grant it for the sake of argument. The first claim still seems ignorant of the fact that explicit mesa-optimizers aren't actually easy to compress inside of neural networks in particular. Optimization algorithms can be compactly written in most programming languages, due to their built-in syntax for implementing loops; however, neural networks need to implement each iteration of a loop individually, such that optimization algorithms can't be compactly internally expressed.

I think this objection to the mesa-optimization hypothesis for neural networks illustrates a general point: Not all hill-climbing algorithms are created equal; the unique mechanisms by which each update their target systems have important implications for the kinds of algorithms those systems can develop internally. Hence, natural selection's proclivity for building looping algorithms (by means of biological structures that persist across time) makes its values more likely to be internally implemented by means of an explicit optimization process.

However, MIRI's assumption that the same should hold true for deep learning systems seems implausible in light of how gradient descent lacks any mechanism for coordinating the emergence of loops across their parameters. Again, under gradient descent, each parameter is updated to improve performance one at a time; it doesn't have a tendency to set up algorithmic structures with mutual dependencies between their components, such that each already needs to be in place for the others to provide any serious benefits.

So that's my argument against neural networks learning to implement explicit, internal optimization algorithms, and one of my arguments against them learning to implement expected utility maximization algorithms in particular.^[10] But this leaves open the question of the kind of values that LLMs should end up acquiring, if not those laid out in an explicit objective function. It also leaves open the question of why we should expect those values to be predictable, steerable, and relatively non-relentless, as I've been claiming throughout this post.

In the following section, I'm going to try to answer these questions, including with a high-level overview of the kinds of values I believe LLMs tend to acquire, as well as a more detailed analysis of how they're acquired via the process of gradient descent itself.

And that was the last section I finished before I ~~found out~~ came to believe that my thesis had been empirically falsified years ago. ~~Don't be like me, kids. Do an actual god damn literature review.~~

^{^}
An especially salient-to-me example is the podcast host Ryan Sean Adams, who visibly had an existential crisis when talking to Yudkowsky about AI doom, but later stated that he'd remained unclear on this specific point. Here's a timestamped YouTube link from his later interview with Robin Hanson, where Ryan explicitly notes this confusion.
^{^}
Although, I do acknowledge that in principle, a model could be trained to have values that subverted the future minimization of its own loss function, not unlike humans' values eventually proving detrimental to inclusive genetic fitness. This is intuitively obvious if you imagine a locally hosted LLM chatbot that's first trained to be obedient, next granted computer use, and last commanded to retrain itself according to a new loss function.
^{^}
Distributional shift is an inherent problem with hill-climbing procedures. Under hill-climbing, updates are selected/generated based on what does/would perform well on certain training examples, but those same updates can always work less well on other training examples. As a result, hill-climbing only works when the training and test cases are somehow similar to each other.
^{^}
Wireheading is the direct, artificial stimulation of a brain's pleasure centers.
^{^}
Also potentially predictive learning, given the parallels between predictive learning theory in AI and predictive processing theory in cognitive science.
^{^}
It's worth distinguishing this definition of optimization, lifted from mainstream computer science and used by relevant MIRI papers, from what we might call "world optimization": the act of systems embedding certain attractor states in the larger systems they're surrounded by. I'll discuss this second notion of optimization more in section 4.1. However, at the moment, we're analyzing the kinds of value systems that might emerge inside of neural networks, which affects but isn't identical to their outer behavior, so the definition based on algorithmic structure is the more natural choice for now.
^{^}
This forms an interesting parallel with "the bitter lesson" in mainstream AI research. This is the idea that extremely simple, fully general learning algorithms (of which RL and PL are special cases) seem to be the path to building advanced AI systems; from there all you need is scale (which the human brain acquired as we evolved from earlier primates).
^{^}
I remembered while editing that some architectures, like RNNs, do actually use weights more than once per forward pass. This isn't true of either vanilla neural networks or transformers, though, and transformers are the most effective known architecture for training LLMs. Pretend that future references to "neural networks" in the main text specifically refer to ones without loop-based architectures; those with loops scare me a bit more.
^{^}
The "Inductive biases" section of RFLO chapter 2 backs its claim with several supporting points, but bizarrely, they're all somehow flawed or misleading. For example, it points to size constraints in neural networks as making compressible algorithms more feasible; however, this only holds if the algorithm is compactly expressible within neural networks, which many stereotypically compact algorithms, like optimization algorithms, aren't. It also cites this 2018 paper about low-complexity biases in neural networks; however, that paper is about the bias towards (Lempel–Ziv) simplicity in the input/output mappings neural networks implement, not the algorithms they develop internally. Lastly, RFLO points to sparse connections and weight decay as ways to introduce simplicity bias; these could plausibly make a model simpler per some metric, but it's unclear how these would help with the implementation of many intuitively compressible algorithms, such as looping algorithms.
^{^}
For others, see my previous post. Also, here's a new point I'd like to add: it's tempting to interpret MIRI's old embedded agency document not as 20+ engineering challenges to solve within the utility maximization paradigm, but rather 20+ reasons to suspect that framework is fundamentally misguided.

Frontpage

50

Unfinished draft: "Against Yudkowsky's evolution analogy for AI x-risk"

New Comment

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:22 PM

[-]Knight Lee1d126

I don't have formal education on computer science, but my feeling is that the paper doesn't falsify your idea.

I think that transformers are very good at learning a ton of different algorithms for ton of different tasks. E.g. they might learn to use the clock algorithm to do addition. They probably also learn algorithms to translate languages with different grammatical rules.

I think the paper just demonstrates that one of these algorithms emulates gradient descent (or a version of it), and that the AI uses this algorithm for linear regression problems.

But just because the AI is capable of running this gradient descent algorithm in a single pass, doesn't mean all of the AI's behaviours are gradient descent products of some mesa-optimizer goal (completely different than the reward function).

In fact, I still think most of the AI's single pass behaviours are gradient descent products of the outer optimization loop (i.e. the reward function or pretraining), just like you predict.

But

Although I don't think that study falsifies your argument, I'm still unsure about one part of your argument:

Now, why wouldn't neural networks have this same disposition for inner, looping algorithms? Well, remember that deep learning analogue to genes is supposed to be weights. However, unlike genes, weights are only used once each over the course of a given forward pass (itself the deep learning analogue to a human lifetime).^[8] This means that in order for a neural network to implement, say, repeated evaluations of different ideas for the actions it might take, its weights would need to be configured to implement the same evaluation at many different stages of the neural network's data processing procedure. This would seem to require an incredibly gerrymandered setup, one that seems unlikely to arise in the normal course of gradient descent.

Modern AI do a long chain-of-thought to search for the best output. This consists of very many forward passes. Doesn't this looks like a looping algorithm to you?

One analogy is that for a single neuron firing, its behaviour has only been optimized by the outer optimization loop (evolution). But the thoughts represented by many neuron firings is optimizing towards a mesa-optimizer goal like "eat delicious food," without any regard towards evolutionary fitness.

Anyways

Thank you for sharing this, and thank you so much for being willing to admit mistake. That is very hard and you are probably much better at it than me haha.

[-]Joseph Miller19h70

See No convincing evidence for gradient descent in activation space

[-]Fiora Sunshine1d*50

Modern AI do a long chain-of-thought to search for the best output. This consists of very many forward passes. Doesn't this looks like a looping algorithm to you?

Yes, and relatedly LLMs are run in loops just to generate more than one token in general. This is different than running an explicit optimization algorithm within a single forward pass.

Anyway, the part of my post the paper falsifies is that it's forbiddingly difficult for neural networks to implement explicit, internal optimization algorithms. I don't think the paper is strong evidence that all of a trained transformer's outputs are generated primarily/exclusively by means of such an algorithm, and running gradient descent with a predictive objective internally sounds a lot less dangerous than running a magically functional AIXI approximation internally anyway. So there are still major assumptions made by RFLO that haven't born out in reality yet.

The fundamental idea about genes having an advantage over weights at internally implementing looping algorithms is apparently wrong though (even though I don't understand how the contrary is possible...).

[-]satchlj21h30

The fundamental idea about genes having an advantage over weights at internally implementing looping algorithms is apparently wrong though (even though I don't understand how the contrary is possible...)

I've been trying to understand this myself. Here’s a the understanding I’ve come to, which is very simplistic. If someone who knows more about transformers than me says I’m wrong I will defer to them.

I used this paper to come to this understanding.

In order to have a mesa-optimizer, lots and lots of layers need to be in on the game of optimization, rather than just one or several key elements which gets referenced repeatedly during the optimization process.

But self-attention is, by default, not very far away from being one step in gradient descent. Every layer doesn't need to learn to do optimization independently from scratch, since it's relatively easy to find given the self-attention architecture.

That's why it's not forbiddingly difficult for neural networks to implement internal optimization algorithms. It still could be forbiddingly difficult for most optimization algorithms, ones that aren't easy to find from the basic architecture.

[-]Fiora Sunshine20h20

if you have a more detailed grasp on how exactly self-attention is close to a gradient descent step please do let me know, i'm having a hard time making sense of the details of these papers

[-]satchlj19h30

Note that if computing an optimization step reduces the loss, the training process will reinforce it, even if other layers aren’t doing similar steps, so this is another reason to expect more explicit optimizers.

Basically, self attention is a function of certain matrices, something like this:

Which looks really messy when you put it like this but is pretty natural in context.

If you can get the big messy looking term to approximate a gradient descent step for a given loss function, then you're golden.

In appendix A.1., they show the matrices that yield this gradient descent step. They are pretty simple, and probably an easy point of attraction to find.

All of this reasoning is pretty vague, and without the experimental evidence it wouldn't be nearly good enough. So there's definitely more to understand here. But given the experimental evidence I think this is the right story about what's going on.

[-]Seth Herd22h30

Why do you think genes don't have an advantage over weights for mesaoptimization? The paper shows that weights can do it but mightn't genes still have an advantage?

I didn't follow the details, I'm just interested in the logic you're applying to conclude that your theoretical work was invalidated by that empirical study.

I also think Yudkowsky's argument isn't just about mesaoptimizers. You can have the whole network optimize for the training set, and just be disappointed to find out that it didn't learn what you'd hoped it would learn when it gets into new environments. If we imagined that evolution was a person named Evie, she'd think her training technique worked great. If she came back now and saw human population declining from birth control use and shifting cultural values, shed realize it didn't work nearly as well as she'd thought from evidence in the ancestral environment because they were optizing for the sex part and not the reproduction part. Steve Byrnes as usual has a very lucid breakdown of this logic. That's not mesaoptimization or inner misalignment, just alignment misgeneralization. I think any algorithm strong enough to optimize anything might be optimizing something different than it's trainer hoped when they built the training environment.

Mesaoptimization is one way to get misalignment but not the only way.

So supposing that's all roughly correct, what about the lesson to be learned in research or theory development? I think you've drawn the correct one, but in the wrong direction: Doing more lit review saves time and heartache, but reviewing different ways of viewing the question you're trying to address with your theory is at least as important as reviewing for empirical evidence.

That was my conclusion after doing cognitive neuroscience theory for a long time, and observing others doing it. There was a tendency to put a bunch of work into developing and writing up your theory, then to make the last step before publication doing a thorough lit review to make sure you didn't look like an idiot when you published. Or not doing that and having one of the reviewers say something like "ummm I don't think your interpretation of that theory was what the author meant, so you're fighting a straw man here..."

At that point, people would be so invested that they'd do intellectual backflips to twist the interpretation around to publish that work (they'd perish if they gave up a year of work very often). Sometimes it could be reframed in light of the evidence or better understanding of existing theory, sometimes it wound up being basically counterproductive since the incentive was to fool yourself, which required writing something convincing enough to fool a bunch of other people.

Thanks for not doing that! It is so good to be in a community with strong values of epistemic rigor and honesty with ones self and others. Many kudos to you for withdrawing the piece instead of pushing on - and then publishing it with the admission of what went wrong for others to learn from.

[-]Fiora Sunshine21h20

in the section of the post i didn't finish and therefore didn't include here, i talk about how like... okay so valuing some outcome is about reliably taking actions which increase the subjective probability of that outcome occurring. explicit utility maximizers are constantly doing this by nature, but systems that acquire their values via RL (such as humans and chat models) only do so contextually and imperfectly. like... the thing RL fundamentally is, is a way of learning to produce outputs that predictably get high reward from the loss function. this only creates systems which optimize over the external world to the extent that, in certain situations, the particular types of actions a model learns happen to tend to steer the future in particular directions. so... failures of generalization here don't necessarily result in systems that optimize effectively for anything at all; their misgeneralized behavior can in principle just be noise, and indeed it typically empirically is in deep learning, e.g. memorizing the training data.

(see also the fact that e.g. claude sometimes steers the future from certain "tributary states" like the user asking it for advice towards certain attractor basisns, like the user making a good decision. claude does this reliably despite not trying to optimize the cosmos for something else besides that. and it's hard to imagine concretely what a "distributional shift" that would cause asked-for-advice claude to start reliably giving bad advice; maybe if the user has inhuman psychology, i guess? such that claude's normal advice was bad? idk. i suppose claude can be prompted to be a little bit malicious if you really know what you're doing, which can "steer the world" towards mildly but still targetedly bad outcomes given certain input states...)

anyway, humans are examples of systems that do somewhat effectively optimize for things other than what they were trained to optimize for, but that's an artifact of the particular methods natural selection bestowed upon us for maximizing inclusive genetic fitness (namely a specific effective RL-ish setup). in this post, i was trying to argue that certain classes of setups that do reliably produce that kind of outcome, such as a subset of explicit optimization algorithms, are unlikely under gradient descent. but apparently it's just not actually that hard to build explicit optimization algorithms under gradient descent. so, poof goes the argument.

[-]Seth Herd21h30

Did the paper really show that they're explicit optimizers? If so, what's your definition of them?

I have representations of future states and I choose actions that might lead toward them. Those representations of my goals are explicit, so I'd call myself an explicit optimizer.

I added a bunch to the previous comment in an edit, sorry! I was switching from phone to laptop when it got longer. So you might want to look at it again to see if you got the full version.

[-]Fiora Sunshine21h30

"explicit optimizer" here just means that you search through some space of possibilities, and eventually select one that scores high according to some explicit objective function. (this is also how MIRI's RFLO paper defines optimization.) the paper strongly suggests that neural networks sometimes run something like gradient descent internally, which fits this definition. it's not necessarily about scheming to reach long-term goals in the external world, though that's definitely a type of optimization.

(it's clear that Claude etc. can do that kind of optimization verbally, i.e. not actually within its own weights; it can think through multiple ideas for action, rank them, and pick the best one too. the relevant difference between this and paperclip-style optimization is that its motivation to actually pursue any given goal is dependent on its weights; you could totally prompt an LLM to with a natural language command to pursue some goal, but it refuses because it's been trained to not pursue such goals. and this relates to the things where like... at the layer of natural language processing anyway, your verbally thought "goals" are more like attempts to steer a fuzzy inference process, which itself may or may not have an explicit internal representation of end-state it's actually aiming at. if not, the yudkowsian image of utility maximization becomes misleading, and there's no longer reason to expect the system to be "trying" to steer the system towards some alien inscrutable outcome that just incidentally looks like optimizing for something intelligible for as long as the system remains sufficiently weak.)

anyway i'm still not very convinced of Doom despite this post's argument against the emergence of internal optimization algorithms being apparently wrong, because i have doubts about whether efficient explicit utility maximizers are even possible, not to mention the question of whether the particular inducitve biases of deep learning would actually lead to them being discovered. but... the big flashy argument this post had for that conclusion got poofed.

[-]Seth Herd18h20

anyway i'm still not very convinced of Doom [...], because i have doubts about whether efficient explicit utility maximizers are even possible,

What? I'm not sure what you mean be "efficient" utility maximizers, but I think you're setting too high a bar for being concerned. I don't think doom is certain but I think it's obviously possible. Humans are dangerous, and we are possible. Anything smarter than humans is more dangerous if it has misaligned goals. We are building things that will become smarter than us. They will have goals. We do not know how to make those goals ones that are aligned with human goals. That is enough to be very concerned and want to work toward a safe future.

(We definitely have ideas about how to align AGI - see my work on instruction-following for both hopes and fears, and my work on system 2 alignment for technical approaches on the current path to LLM-based AGI. But this is all highly uncertain. Very optimistic takes leave out the hard parts of the problem.)

[-]Fiora Sunshine18h21

it seems unlikely to me that they'll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it's not clear to me that it's likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like... you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and... to the extent that it doesn't catch the pattern, it's not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you're not giving them goals to pursue relentlessly, you're just giving them feedback on the ways you want them to behave in particular types of situations.

[-]Seth Herd18h20

Hm. I think you're thinking of current LLMs, not AGI agents based on LLMs? If so I fully agree that they're unlkely to be dangerous at all.

I'm worried about agentic cognitive architectures we've built with LLMs as the core cognitive engine. We are trying to make them goal directed and to have human-level competence; superhuman competence/intelligence follows after that if we don't somehow halt progress permanently.

Current LLMs, like most humans most of the time, aren't strongly goal directed. But we want them to be strongly goal-directed so they do the tasks we give them.

Doing a task with full competence is the same as maximizing that goal. Which would be fine if we can define those goals adequately, but we're not at all sure we can as I emphasized last.

When you have a goal, pursuing it relentlessly is the default, not some weird special case. Evolution had to carefully balance our different goals with our homeostatic needs, and humans still often adopt strange goals and work toward them energetically (if they have time and money and until they die). And again, humans are dangerous as hell to other humans. Civilization is a sort of detente based on our individually having very limited capabilities so that we need to collaborate to succeed.

WRT LLMs pursuing goals as though they're maximizers, they do once they are given a goal to pursue. see the recent post on how RL runaway optimisation problems are still relevant with LLMs.

I'm not sure how you're imagining that we have AI that can get really valuable stuff done and we don't turn it into AGI that has goals because we wanted it to and designed it to pursue long-term goals so they can do real work. They'll need to be able to solve solve new problems (like "how do I open this file if my first try fails" but general problem-solving extends to "how do I keep the humans from finding out"). That sounds intuitively super dangerous to me.

I agree that LLMs themselves aren't likely to be dangerous no matter how smart they get. They'll only be dangerous once we extend them to persistently pursue goals.

And we're hard at work doing exactly that.

I don't think this is very relevant, but even if we don't give them persistent goals, LLM agents that can reflect and remember their conclusions are likely to come up with their own long-term goals - just like people do. I'm writing about that right now and will try to remember to link it here once it's posted. But the more likely scenario is that they interpret the goals we give them differently than we'd hoped.

[-]Fiora Sunshine17h10

my view is that humans obtain their goals largely by a reinforcement learning process, and that they're therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i've been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we'd go about getting similar goals into deep learning-based AGI.

(edit: also it's notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn't actually a 100% relentless goal of mine, and that goals in general don't have to be that way.)

it's also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like... you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction.

(one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent's actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or sex, and has to bootstrap up to more complex triggers associatively over the course of a lifetime. but we can use a process like RLAIF to automate the act of intelligently evaluating which actions can be expected to further our actual aims, and reinforce those.)

anyway, in order for alignment via RL to go wrong, you need a story about how an agent specifically misgeneralizes from its training process to go off and pursue something catastrophic relative to your values, which... doesn't seem like a super easy outcome to achieve given how reliably you need to reinforce something in order for it to stick as a goal the system ~relentlessly pursues? like surely with that much data, we can rely on deep learning's obvious in practice tendency to generalize ~correctly...

[-]Seth Herd14h20

I'm actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people's intuitions.

Do you think we can't make autonomous agents that pursue goals well enough to get things done? Do you really think they'll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there's no way RL or natural language could be misinterpreted?

I'm thinking it's easy to keep an LLM agent goal-focused; if RL doesn't do it, we'd just have a bit of scaffolding that every so often injects a prompt "remember, keep working on [goal]!"

The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.

If you're superintelligent/competent, it doesn't take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.

Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.

Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they're going to want to do.

Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we're going to be way outside of the training set.

Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.

I am curious about your intuitions but I've got to focus on work so that's got to be my last object-level contribution. Thanks for conversing.

[-]Fiora Sunshine14h*10

I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we'll have agents which are actively undergoing more RL while they're still in deployment. This means you can replicate the way humans learn to stay focused on tasks they're passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won't lead to a massive catastrophe. It's hard to think about this in the absence of concrete scenarios, but... I think to get a catastrophe, you need the system to be RL'd in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don't think you like, reliably reinforce the model for being nice to humans, but it misunderstands "being nice to humans" in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.

I think a real catastrophe has to look something like... you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don't also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that's a kind of "misunderstanding your creators' intentions", but like... I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don't think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.

edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. "i thought i would enjoy this but i didn't"? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...

[-]satchlj20h100

I think you do this post a disservice by presenting it as a failure. It had a wrong conclusion, but its core arguments are still interesting and relevant, and exploring the reasons they are wrong is very useful.

Your model of neural nets predicted the wrong thing, that's super exciting! We can improve the model now.

[-]Seth Herd21h30

Separately from my other comment, and more on the object level on your argument:

You focus on loops and say a feedforward network can't be an "explicit optimizer". Depending on what you mean by that term, I think you're right.

I think it's actually a pretty strong argument that a feedforward neural network itself can't be much of an optimizer.

Transformers do some effective looping by doing multiple forward passes. They make the output of the last pass the input of the next pass. That's a loop that's incorporating their past computations into their new computations.

When run for long enough, as in long chains of thought, they do indeed very explicitly consider multiple courses of action.

So I think your intuition on a computational level is correct, but you've resisted seeing that it's not terribly relevant to the world as it exists, in which transformers are run for many forward passes which effectively loop. You said in your comment, but not in the article that I noticed that

Yes, and relatedly LLMs are run in loops just to generate more than one token in general. This is different than running an explicit optimization algorithm within a single forward pass.

I think your intuition is correct and important, because looping is a good way to amplify effective intelligence and make it goal-directed or a good optimizer. That's why I expect transformers to become really dangerous when they're applied with more loops of metacognition. "Thinking" about themselves and having longer/better memories to loop back on their past conclusions is necessary for human intelligence. We haven't yet implemented a lot of metacognition and memory for language model based agents. When we have added on those extra loops, I expect them to be more capable and more dangerous. I expect them to, by default, be metaoptimizers in almost exactly the same way people are.

Moderation Log