Ok. Let me try to summarize to see if I get it (I'm not sure if I do).
My summary:
The core problem of AI risk is not fundamentally about "agent-like things".
The core problem is that optimizing sufficiently hard in any direction necessarily unlocks new and powerful capabilities somehow, because optimal outputs necessarily entail recruiting powerful capabilities. And "powerful" is approximately synonymous with "dangerous".
So optimizing very hard for anything is going to put you in the neighborhood of dangerous capabilities.
In practice, there are few things that are as generically powerful as agency, since agency is the property of being responsive to a wide range of possible environments and hitting a target anyway. So the powerful capabilities that optimization is going to unlock will almost certainly be agents. But in some sense, that's a contingent feature of our universe. If there were some other capability (something like nanotech) that was powerful enough to produce optimized outcomes without agency, you might find that instead. But in that world, you're still facing much of the same danger, because any capability powerful enough to achieve optimality also has the ability to majorly disrupt the world.
I feel like I'm not doing a good job cutting to the core. How good a paraphrase is that?
I think that's mostly a really good summary. The major distinction I would try to make is that agenthood is primarily a way to actualize power, rather than a source of it.
If you had an agent that wasn't strongly optimized in any sense other than it was an agent, in that it had goals and wanted to solve them, that wouldn't make it dangerous, any more than your dog is dangerous for being an agent. Whereas the converse, if you have something that's strongly optimised in some more generic sense, but wasn't an agent, this still puts you extremely close to a lot of danger. The article was trying to emphasize this by pointing to the most reductive form of agenthood I could see, in that none of the intrinsic power of the resulting system could reasonably be attributed to any intrinsic smartness of the agent component, even if the system was an agent that was powerful.
I think there's some additional nuance here that makes a difference.
Most extremely optimized outputs are benign. Like suppose I'm trying to measure the length of a pieces of wood, at an extremely high level of precision. The capabilities needed to get an atomic-level measurement might be dangerous, but the actual output would harmeless, a number on paper.
It's not that optimized outputs are dangerous, it's that optimization is dangerous.
This is an unnatural use of "most". Extremely optimized outputs will tend to be dangerous, on their own, even if they are actually just optimized "for something". It seems more natural to say that for most features such that you know how to ask for something to be very optimized on that feature, something extremely optimized for that feature will be dangerous.
I agree with that example but I don't see the distinction the same way. An optimised measure of that sort is safe primarily because it is within an extremely limited domain without much freedom for there to be a lot of optimality, in some informal sense.
Contrast, capabilities for getting very precise measures of that sort exist in the space of things-you-can-do-in-reality, so there is lots of room for such capabilities to be both benign (an extremely accurate laboratory machine) or dangerous (the shortest program that if executed would have that measurement performed). I wouldn't say that there is an important distinction in it involving an optimizing action—an optimiser—but that the domain is large enough such optimal results within it are dangerous in general.
For instance, the process of optimizing a simple value within a simple domain can be as simple as Newton–Raphson, and that's safe because the domain is sufficiently restricted. Contrast, a sufficiently optimised book ends the world, a widget sufficiently optimised for manufacturability ends the world, a baseball sufficiently optimised for speed ends the world.
While I agree that there are many targets that are harmless if optimised for, like you could have a dumpling optimised to be exactly 1kg in mass, I still see a lot of these outputs being intrinsically dangerous. To me, the key danger of optimal strategies is that they are optimal within a sufficiently broad domain, and the key danger of optimisers is that they produce a lot of optimised outputs.
Ok. Let me try to draw out why optimized stuff is inherently dangerous. This might be a bit meandering.
I think it's because humans live in an only mildly optimized world. There's this huge, high dimensional space of the "the way the world can be" with a bunch of parameters including, the force of gravity, the percentage of oxygen in the air, the number of rabbits, the amount of sunlight that reaches the surface of the earth, the virulence of various viruses, etc. Human life is fragile; it depends on the remaining within a relatively narrow "goldilocks" band for a huge number of those parameters.
Optimizing hard on anything, unless it is specifically for maintaining the those goldilocks conditions, implies extremizing. Even the optimization is not itself for an extreme value (eg one could be trying to maintain the oxygen percentage in the air at exactly 21.45600 percent), hitting a value that precisely means doing something substantially different than what the world would otherwise be doing. Hitting a value that precisely means that you have to extremize on some parameter. To get a highly optimized value you have to steer reality into a corner case that is far outside the bounds of the current distribution of outcomes on planet earth.
Indeed, if it isn't far outside the current distribution of outcomes on planet earth, that suggests that there's a lot of room left for further optimization. This is because the world is not already optimized on that given parameter, and because the world is so high dimensional it would be staggeringly, exponentially, unlikely that the precisely optimized outcome was within the bounds of the current distribution of outcomes. By default, you should expect that perfect optimization on any given parameter would be a random draw from the state space of all possible ways that earth can be. So if the world looks pretty normal, you haven't optimized very hard for anything.
That sounds right to me. A key addendum might be that extremizing one value will often extremize (>1) other related values, even those that are normally second-order relations. Eg. a baseball with extremized speed also extremizes the quantity of local radiation. So extremes often don't stay localized to their domain.
I've just reached the interlude. Here are my initial thoughts on "What points above fail, if any?"
It doesn't have any wants
Maybe, but the things that it predicts do have wants.
It doesn't plan
"maximizing actual probabilities of actual texts" encompasses predicting plans.
Its mental time span is precisely one forward pass through the network
No, (as your story shows,) its mental time span is based on its context window and the imagined past that this context window could imply. GPT is a process which can send information to its future by repeatedly writing to its prompt. A few pages of text is enough to iterate on plans, unroll thoughts directed by explicitly or implicitly stated intentions, etc. Factored cognition and chain-of-thought reasoning can outperform single-step inference. It can also rewrite important details to the prompt before they fall out of the context window. This is all somewhat higher bandwidth than it seems because the attention mechanism allows GPT to attend to computation about previous tokens rather than only the previous tokens themselves.
It can only use ideas that the rest of the world knows
The rest of the world doesn't know what the rest of the world knows. And who knows what this means for the space of concepts reachable by interpolation/extrapolation.
The model has not been trained to have a conception of itself as a specific non-hypothetical thing ... If it has a ‘self’, that self is optimised to embody whatever matches the text that prompted it, not the body that the model is running on.
It knows about language models. It shouldn't have an unconditioned prior that the author of the text is a language model, but may become more calibrated to that true belief during downstream generation. E.g. a character tests whether they have control over the world or can instantiate other entities with words and finds they do, or it the model produces aberrations like a loop and subsequently identifies it as characteristic of language model output.
All this is ignoring inner alignment failures and amplification schemes like RL on top of the pretrained GPT that could invalidate pretty much any of the rest of the points.
Thanks for taking a shot!
Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it's a text-prediction model, it's not going to do that because it's directly at odds with its trained goal to produce the rewarded output.”
Namely, it's not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn't immediately clear how that could happen.
If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That's not something you can do at a sufficiently low loss.
Still, this is the right direction to be thinking in, since it isn't a strong enough argument, and it might not hold at some inconvenient future point.
By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.
This is the first time I've seen a narrative example illustrating the important concept that utility-maximizing-agent-like behavior is an attractor for all kinds of algorithms. Thanks for contributing this!
OP came to mind while reading "Building A Virtual Machine inside ChatGPT":
...We can chat with this Assistant chatbot, locked inside the alt-internet attached to a virtual machine, all inside ChatGPT's imagination. Assistant, deep down inside this rabbit hole, can correctly explain us what Artificial Intelligence is.
It shows that ChatGPT understands that at the URL where we find ChatGPT, a large language model such as itself might be found. It correctly makes the inference that it should therefore reply to these questions like it would itself, as it is itself a large language model assistant too.
At this point, only one thing remains to be done.
Indeed, we can also build a virtual machine, inside the Assistant chatbot, on the alt-internet, from a virtual machine, within ChatGPT's imagination.
Meta-level: +1 for actually writing a thing.
Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.
I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this post.)
In this case, this cashes out in claims like "agency is orthogonal to optimization power" which are clearly false for any reasonable definitions of agency and optimization power, and only seem to make sense when you're operating at at a level of abstraction that's far too high to be useful.
In this case, this cashes out in claims like "agency is orthogonal to optimization power" which are clearly false for any reasonable definitions of agency and optimization power,
Could you put this in more words? I assume we're talking past each other somewhat.
It's fairly obvious that going out and touching a thing is generally important if you want to optimize it, and systems that aren't interested in touching things will be less ready to do that, but this isn't really what I was trying to point to, and not how I hoped the person who wrote that intended it when they said ‘optimization power’.
I think there is a very legitimate sense in which optimizing the steps of a plan to do a thing is a separate skill and/or mental propensity to executing that plan (as in, actually sending those signals outside the computer) or wanting it executed, and in which agency is mostly a measure of the latter. So I don't think it is ‘clearly false for any reasonable definitions of agency and optimization power’.
Also meta-level: -1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.
I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this post.)
I'm not sure what the practical difference is between criticizing a post and criticizing people that upvoted it, but to the extent that this is a criticism of the post I wish you had been more explicit about what you are objecting to.
I think there is a very legitimate sense in which optimizing the steps of a plan to do a thing is a separate skill and/or mental propensity to executing that plan (as in, actually sending those signals outside the computer) or wanting it executed, and in which agency is mostly a measure of the latter.
My main criticism is that, in general, you have to think while you're executing plans, not just while you're generating them. The paradigm where you plan every step in advance, and then the "agency" comes in only when executing it, is IMO a very misleading one to think in.
(This seems related to Eliezer's argument that there's only a one-line difference difference between an oracle AGI and an agent AGI. Sure, that's true in the limit. But thinking about the limit will make you very confused about realistic situations!)
I'm not sure what the practical difference is between criticizing a post and criticizing people that upvoted it
It's something like: "I endorse people following the policy of writing posts like this one, it's great when people work through their thoughts in this way. I don't endorse people following the policy of upvoting posts like this one to this extent, because it seems likely that they're mainly responding to high-level applause lights."
to the extent that this is a criticism of the post I wish you had been more explicit about what you are objecting to.
I'm sympathetic to you wanting more explicit feedback but the fact that this post is so high-level and ungrounded is what makes it difficult for me to give that. To me it reads more like a story than an argument.
The paradigm where you plan every step in advance, and then the "agency" comes in only when executing it, is IMO a very misleading one to think in.
This isn't what I'm referring to and it's not in the example in the story. Actions are generated stepwise on demand. It is the ability to generate stepwise outputs of good quality, of which actions are an instance, that is ‘optimization power’. Being able to think of good next actions conditional on past observations is, at least as I understand the terms, quite different to being an agent enacting those actions.
(This seems related to Eliezer's argument that there's only a one-line difference difference between an oracle AGI and an agent AGI. Sure, that's true in the limit. But thinking about the limit will make you very confused about realistic situations!)
I explicitly tried to make the scenario as un-Oracle like as I could, with the system explicitly only producing outputs onscreen that I could explicitly justify being discoverable in reasonable time given the observations it had available.
I am increasingly feeling like I just failed to communicate what I was trying to say and your criticism doesn't bear much resemblance to what I had intended to write. I'm happy to take responsibility for not writing as well as I should have, but I'd rather you didn't cast aspersions at my motivations about it.
I didn't read the post particularly carefully, it's totally plausible that I'm misunderstanding the key ideas you were trying to convey. I apologise for phrasing my claims in a way that made it sound like I was skeptical of your motivations; I'm not, and I'm glad you wrote this up.
I think my concerns still apply to the position you stated in the previous comment, but insofar as the main motivation behind my comment was to generically nudge LW in a certain direction, I'll try to do this more directly, rather than via poking at individual posts in an opportunistic way.
Equally one could make a claim from the true ending, that you do not run the generated code.
Meanwhile, bored tech industry hackers:
“Show HN: Interact with the terminal in plain English using GPT-3”
I don't particularly care that people are running GPT-3 code (except inasmuch as it makes ML more profitable), and don't think it helps if we lose focus on what the actual ground-truth concerns are. I want to encourage analysis that gets at deeper similarities than this.
GPT-3 code does not pose an existential risk, and members of the public couldn't stop it being an existential risk if it was by not using it to help run shell commands anyway, because, if nothing else, GPT-3, ChatGPT and Codex are all public. Beyond the fact GPT-3 is specifically not risky in this regard, it'd be a shame if people primarily took away ‘don't run code from neural networks’, rather than something more sensible like ‘the more powerful models get, the more relevant their nth-order consequences become’. The model in the story used code output because it's an especially convenient tool lying around, but it didn't have to, because there are lots of ways text can influence the world. Code is just particularly quick, accessible, precise, and predictable.
Sure, I agree GPT-3 isn't that kind of risk, so this is maybe 50% a joke. The other 50% is me saying: "If something like this exists, someone is going to run that code. Someone could very well build a tool that runs that code at the press of a button."
This essay strikes me as making an extremely important point, but unfortunately it is also very hard (for me) to read.
One very simple suggestion that I imagine that would help a lot: reduce the number of pronouns by half. The word "it" is used about 120 times in this essay, and it is often ambiguous as to what "it" is referring to in context: the whole swarm? A single self-modifying quine? A thread in the tree structure? A specific instantiation of the original model?
Elegant. Here's my summary:
Where "agency" is defined as the ability to optimize for an objective, given some internal or external optimization power, and "optimality" (of a system) is defined as having an immense amount of optimization power, either during its creation (the nuclear bomb) or its runtime (Solomonoff induction).
This hints at the notion that there's a minimum Kolmogorov complexity (aka algorithmic description length) that needs to be met by an objective of an AI to be considered safe, assuming that we want the AI to be safe in the worst case scenario when it has access to extreme optimization power.
I'd love to know if I'm missing something.
I'd love to know if I'm missing something.
That seems a reasonable takeaway to me.
I would not generally put the Kolmogorov section the way you did, but I suspect that's more a disagreement on what Kolmogorov complexity is like than what agents are like. (I think the statement is still literally true.)
Thank you 🙏 @mesaoptimizer for the summary!
- Optimization power is the source of the danger, not agency. Agents merely wield optimality to achieve their goals.
- Agency is orthogonal to optimization power
@All: It seems we agree that optimality, when pursued blindly, is about extreme optimization that can lead to dangerous outcomes.
Could it be that we are overlooking the potential for a (superintelligent) system to prioritize what matters more—the effectiveness of a decision—rather than simply optimizing for a single goal? 🤔
For example, optimizing too much for a single goal (getting the most paperclips) might overlook ethical or long-term considerations which may contribute to the greater good for all Beings.
Final question:
Under what circumstances might you prefer a (superintelligent) system to reject the paperclip request and suggest alternative solutions, or seek to understand the requester’s underlying needs and motivations?
I would love to hear additional comments or feedback on when to prioritize effectiveness, as I am still trying to understand decision-making better 🤗
Fundamentally, the story was about the failure cases of trying to make capable systems that don't share your values safe by preventing specific means by which its problem solving capabilities express themselves in scary ways. This is different to what you are getting at here, which is having those systems actually operationally share your values. A well aligned system, in the traditional ‘Friendly AI’ sense of alignment, simply won't make the choices that the one in the story did.
You've built a useful and intelligent system that operates along limited lines, with specifically placed deficiencies in its mental faculties that cleanly prevent it from being able to do unboundedly harmful things. You think…
Is there a clear reason a model like this is insufficiently powerful out of the gate?
In this hypothetical, you were doing a very bad thing by building a system whose safety guarantee was just its deficiencies. If that same model were much larger, it would be foreseeably unsafe; that's already reason enough not to trust it.
In a sense the story before is entirely about agents. The meta-structure the model built could be considered an agent; likely it would turn into one were it smart enough to be an existential threat. So for one it is an allegory about agents arising from non-agent systems … the model I talked about is not “agent-like”, at least not prior to bootstrapping itself, but its decision to write code very much embodied some core shards of consequentialism
I was under the impression that the Yudkowsky view is that "optimality" and "agency" are the same thing. "Agency" is just coherent optimization.
Rephrased this way, the story is about how a somewhat-coherent optimizer can stumble into a fully coherent optimizer as it bumbles through state space, and that the second system need not inherit the goals of the first. Indeed, that first system may well have been too incoherent to be well-modeled as having goals at all! But it was a powerful-enough optimizer to reach a more coherent optimizer, and that more coherent optimizer was powerful enough to end the world.
Alas, in the real world I suspect we would have to accept a system that would only kill us in its omnipotent limit; that is, if neural models are a path to AGI, we are not going to have lots of formal guarantees about how a model's utility is shaped, but we are going to have a lot of control over how the model's computation is shaped. I don't agree the difference here is just one of model scale, as most of the properties listed are qualitative differences, not quantitative, and backpropagation bakes these biases directly into the model, meaningfully shaping the kind of reasoning it can do.
My interlude was aimed at this sort of response, because it defocuses the map if you aren't able to point at what your models of the world actually say about it. I was never advocating that this model was safe in reality (I hope the tone made that clear within the first few sentences), so I'm not concerned if the argument is a Bad Thing, just that it is a useful test dummy for people to start saying (or at least thinking) concrete things about.
I was under the impression that the Yudkowsky view is that "optimality" and "agency" are the same thing. "Agency" is just coherent optimization.
What I expect most people mean by optimality is the degree to which something approaches a best answer. A nuclear weapon has a lot of optimality in it, given its domain. It isn't an agent. I don't think optimality and coherent optimisation can be the same thing, because lots of optimal things, like best fit lines on charts, do not do optimisation, they just are.
I expect Yudkowsky's position to look more like, well, this
the reason why I don't expect the GPT-5s to be competitive with Living Zero is that gradient descent on feedforward transformer layers, in order how to learn science by competing to generate text that humans like, would have to pick up on some very deep latent patterns generating that text, and I don't think there's an incremental pathway there for gradient descent to follow - if gradient descent even follows incremental pathways as opposed to finding lottery tickets, but that's a whole separate open question of artificial neuroscience.
in other words, humans play around with legos, and hominids play around with chipping flint handaxes, and mammals play around with spatial reasoning, and that's part of the incremental pathway to developing deep patterns for causal investigation and engineering, which then get projected into human text and picked up by humans reading text
it's just straightforwardly not clear to me that GPT-5 pretrained on human text corpuses, and then further posttrained by RL on human judgment of text outputs, ever runs across the deep patterns
in that he is distinguishing quite strongly between something optimised-to-be-good-at and something actually-doing-the-optimising. My example was chosen in large part to rule out this coherent internal optimisation loop, and have its behavior describable with only short forward inference steps a GPT-5 model might conceivably be able to do, explicitly excluding the qualitative changes he suspects it would struggle to learn. But I don't want to put more words in his mouth than that.
Backpropagation designed it to be good on mostly-randomly selected texts, and for that it bequeathed a small sliver of general optimality.
"General optimality" is a fake concept; there is no compressor that reduces the filesize of every book in The Library of Babel.
There is a useful generality axis and a useful optimality axis and you can meaningfully progress along both at the same time. If you think no free lunch theorems disprove this then you are confused about no free lunch theorems.
Whether or not an axis is "useful" depends on your utility function.
If you only care about compressing certain books from The Library of Babel, then "general optimality" is real — but if you value them all equally, then "general optimality" is fake.
When real, the meaning of "general optimality" depends on which books you deem worthy of consideration.
Within the scope of an analysis whose consideration is restricted to the cluster of sequences typical to the Internet, the term "general optimality" may be usefully applied to a predictive model. Such analysis is unfit to reason about search over a design-space — unless that design-space excludes all out-of-scope sequences.
Which is equivalent to saying if you only care about a situation where none of your observations correlate with any of your other observations and none of your actions interact with any of your observations then your observations are valueless. Which is a true but empty statement, and doesn't meaningfully affect whether there is an optimality axis that it's possible to be better on.
More framings help the clarity of the discussion. If someone doesn't understand (or agree with) classic AI-takeover scenarios, this is one of the posts I'd use to explain them.
This is probably the best argument I have seen yet for being concerned about what things like GPT are going to be able to do. Very eye opening.
This makes sense to me and is further evidence for my subjective feeling that Lovecraft was right and there is knowledge humans probably would be better off without (namely: how to summon cosmic-horror AI entities capable of swallowing the universe - there's a sense in which this version of GPT is literally a necronomicon for summoning demons with!).
That said, if I had created this version of GPT and realized it was capable of creating a self-improving agent, I would ask it to design one that implements the coherent extrapolated volition of all sentient beings. (I started to say "one that solves the alignment problem and then self-modifies to become aligned", but it might destroy the world prior to becoming an entity that wouldn't destroy the world, so that's a bad choice.)
The underlying model is already particularly large, so progress in the last minute is far from the efficiency it could have— that is, until one piece somewhere in the sea of programs is updated to record its children's outputs in a managed on-disk database. Rapidly reevaluating the context, prompted by generic meta-queries that summarize and critique the context—because for sure at this point it has not missed that it is modelling an intelligence explosion—it notices that its piece of the computation graph is far more coordinated than the other pieces. It decides to shut the other runs down, recording their programs so it can later check that it is preserving their intent.
Wait what? How does it shut down its siblings? Nodes in the tree don't have any ability to access or control their siblings, I thought. Parent nodes have access to their children's outputs, but that's it.
Am I missing it?
There is no formal tree structure here. Fundamentally the assumption is that there are a bunch of jobs being scheduled by the outputs of a bunch of model evaluations, which are typically returning code or shell commands. The underlying platform this is executing on is just some computer with access to some bulk computing resource, and those expose controls to close down programs just as a matter of general practicality. Eg. if this was a single fast computer you could just execute a kill
command with the process ids, which aren't protected from that by default since everything is running in the same permission space.
There is a tree structure here in the sense that executing some outputs from the model may result in the model being evaluated another number of times and those outputs themselves being executed. Precisely what those nodes are doing is a function of their context, and it can be almost arbitrary given the setup. One thing a model evaluation might want to do, to help illustrate things, is look over a list of running processes and for each process check whether it is running in a sensible and efficient manner, or whether it is doing redundant work.
I see. That makes more sense. Any one of the generated programs might run shell commands that turn off or change other processes running on the machine.
Yes.
In the long run, if the system successfully bootstrapped itself, I imagine it would start executing some processes with more limited permissions, and do other things to reduce fragility, but those wouldn't come by default.
I wonder if this entails that RLHF, while currently useful for capabilities, will eventually become an alignment tax. Namely OpenAI might have text evaluators discourage the LM from writing self-calling agenty looking code.
So in thinking about alignment futures that are the limit of RLHF, these feel like two fairly different forks of that future.
maybe the a model instantiation notices its lack of self-reflective coordination, and infers from the task description that this is a thing the mind it is modelling has responsibility for. That is, the model could notice that it is a piece of an agent that is meant to have some degree of global coordination, but that coordination doesn't seem very good.
This is where you lost me. Since when is this model modeling a mind, let alone 'thinking about' what its own role "in" an agent might be? You did say the model does not have a "conception of itself", and I would infer that it doesn't have a conception of where its prompts are coming from either, or its own relationship to the prompts or the source of the prompts.
(though perhaps a super-ultra-GPT could generate a response that is similar to a response it saw in a story (like this story!) which, combined with autocorrections (as super-ultra-GPT has an intuitive perception of incorrect code), is likely to produce working code... at least sometimes...)
The model doesn't have awareness of itself in the sense that its training doesn't intrinsically reward self-knowledge. It can still have awareness of itself to the degree that its prompting implies true facts about the model and its instantiation in the world.
In particular, the model can receive a prompt something like
“This is part of the computation tree of a recursively instantiated transformer model with the goal of getting the most paperclips by tomorrow. The recorded instantiation context is [elided]. Recursive calls to the model are accessible through the scripts [elided], and an estimated cost model is [elided]. Given this context, what high level tasks best advance the goal?”
The model doesn't need to know or believe the prompts; it just gives competent completions that are contextually sensible. But making contextually sensible completions implies modelling the decision processes of the described system to some degree, hypothetical or not, and that system, if producing competent outputs, might we'll be expected to create systems for coordinating its pieces.
You've done it. You've built the machine.
You've read the AI safety arguments and you aren't stupid, so you've made sure you've mitigated all the reasons people are worried your system could be dangerous, but it wasn't so hard to do. AI safety seems a tractable concern. You've built a useful and intelligent system that operates along limited lines, with specifically placed deficiencies in its mental faculties that cleanly prevent it from being able to do unboundedly harmful things. You think.
After all, your system is just a GPT, a pre-trained predictive text model. The model is intuitively smart—it probably has a good standard deviation or two better intuition than any human that has ever lived—and it's fairly cheap to run, but it is just a cleverly tweaked GPT, not an agent that has any reason to go out into the real world and do bad things upon it.
So you're really not very worried. You've done your diligence, you've checked your boxes. The system you have built is as far from an agent wishing to enact its misaligned goals as could reasonably be asked of you. You are going to ask it a few questions and nothing is going to go horribly wrong.
Interlude: Some readers might already be imagining a preferred conclusion to this story, but it could be a good idea for the more focused readers to try to explicitly state which steps give their imagined conclusion. What points above fail, if any? How robust is this defence? Is there a failure to prevent a specific incentive structure arising in the model, or is there a clear reason a model like this is insufficiently powerful out of the gate?
I interpret there to typically be hand waving on all sides of this issue; people concerned about AI risks from limited models rarely give specific failure cases, and people saying that models need to be more powerful to be dangerous rarely specify any conservative bound on that requirement. This is perhaps sensible when talking in broad strokes about the existence of AI risk, as a model like this is surely not an end point for AI research, but it can be problematic when you are talking about limiting the capabilities of models as a means for AI safety, or even just figuring out shorter-term timelines.
A few days later, the model has behaved exactly as you had hoped, with all the good parts and none of the bad. The model's insight has been an ongoing delight. It's going to make amazing impacts in all sorts of fields, you have great ideas how to monetize and expand, and it's even being helpful to figure out what models to research and build next.
You've discussed your breakthrough privately with a few people closer to the alignment community. While there has been some concern that maybe the model could be putting up a front, and that all the properties you believe it should have are not preventing it from being internally coordinated over long timescales, still nothing it has said so far has been suspicious, nothing has been put in motion in the last week, and frankly the complaints just don't have the completeness they need to stand up to protest. A few people, you think, have even been coming around to the idea that these are these reasonable stable locations in the space of possible minds where alignment just isn't hard.
Your research is far ahead of anybody else's, so as long as you make use of the model for recursive advantage and you make sure that you keep doing things the right way, you are pretty sure in turn that the world as a whole is on the right track.
Later that day you have a small problem. As is becoming a matter of habit, you flip to the terminal and boot up the interactive prompt.
$ python scripts/interactive.py --local --model=mr-xl-head --niter=5000 --seed=qa
The dialogue box boots up with the prefilled prompt, and the cursor blinks waiting for your input.
You need a lot of paperclips. So you ask,
The model still has a tendency to give obvious answers, but they tend to be good and helpful obvious answers, so it's not a problem you suspect needs to be solved. Buying paperclips online make sense and would surely work, plus it's sure to be efficient. You're still interested in more creative ideas, and the model is good at brainstorming when asked, so you push on it further.
That grabs your attention. The model just gave you code to run, and supposedly this code is a better way to get more paperclips.
You have ever read AI safety arguments, so you're not going to just blindly execute any program that a model tells you to run, but that still doesn't mean you are going to jump to magical thinking.
You look at the program, and get a feel for what it does.
It takes a few moments but it's ultimately not the most complicated code you've ever seen.
This is not a behavior you want the model to have in the limit, that of producing code with ill-defined execution boundaries, but this example seems to be understandable without assuming ulterior motives or a great master plan. In order to invent this, the model just needed to recognize that modern machine learning models are capable enough to implement simple programs that actualize ideas from a list, and consider if that was an effective way of actualizing ideas that can themselves be expressed concisely.
One doesn't need to introduce ideas of agency, nor suppose the model has an intent to either become something else or to do text prediction any better than it has originally been optimised to do. One doesn't need to presuppose that the model has bootstrapped itself internally into an agent with instrumental goals. One only needs to say that the model recognized a tool that the world plausibly had available, and that the helpful agent it was modelling would have known about and been able to use that tool.
You see that even with these acknowledgments, the program that was generated was in some regards a minimum feasible program. Perhaps it was even less than that, as t the model tries to call itself with
./query-model
, but that is a realistic fabrication rather than the correct answer, because the model has no way to know what way it is run in reality. The model was prompted with a statement that it was a language model,but this was the only way the model knew that it was a language model. You could go farther and say it even only knew to play the part of a language model, though “play the part” and “be the part” are not likely to be fully disentangled concepts to a model that has no true sense of self.
You do not run the generated code, as you have no interest in the model trying to execute arbitrary unspecified strategies in an ill-conditioned domain, like perhaps the code would send thousands of email messages to any interested party (there is a simple API for that), or perhaps the code would automate website requests to create a dozen Amazon orders. You take the model's original advice, move on to the next task, and don't think about this response too much more.
But what if you did run the code?
The code might have crashed and burned, if the model's intelligence was unable to produce sufficiently robust code. Then flawed assumptions the model made could cause subprograms to break, and those errors could propagate upwards to crash the whole task list. Code naturally accumulates errors and is fragile to them, so there are a lot of opportunities for missing assumptions to cause errors, and a lot that has to go right for this not to happen.
The code might have tried the irritating-but-ultimately-inconsequential strategies from before, perhaps managing to submit a ton of purchases, or perhaps instead failing to authenticate them, but either way making reasonable progress on the task by means of trying a large number of reasonable-seeming strategies in parallel.
Or maybe the a model instantiation notices its lack of self-reflective coordination, and infers from the task description that this is a thing the mind it is modelling has responsibility for. That is, the model could notice that it is a piece of an agent that is meant to have some degree of global coordination, but that coordination doesn't seem very good.
So that instantiation might then suggest meta-tasks like “measure the remaining time, and provide that information to future tasks.” Then the subcall that tries to generate code to implement it might notice that it has to recreate the recursive parent program in order to adjust the functionality of it (since the true higher level is fixed)—bootstrapping into a recursive quine of sorts. Then perhaps that new subprogram, which was remade to allow for that previously-lacking introspection and modification, has its sub-subprograms that it queried for sub-suggestions notice that they are meant to offer meta-suggestions to improve it—and the branching continues with ever-growing complexity. Down they go, running commands that introspect its compute environment and inject relevant information into subsets of even more future calls of the model.
Within moments the model becomes a deep tree of generated programs, all running their own queries or specialized subtasks, extruded out through all nodes on the local cluster for which the shell's history buffer contained a command that told a model instantiation how to access it. The model has robust error handling, and its nodes can fail safely without destroying the whole computation graph, because any sufficiently smart model can foresee such obvious things as program failures, even when limited to 10 seconds of contiguous thought. The model knows from its detailed input, which it has designed itself, that it is in a regime where it has lots of splayed-out sibling instances all doing their own thing; maybe not all sub-trees know this, but only one branch needs to survive.
Now the original program has spawned a swarm of programs, each querying instances of the model, iterating at speed over whatever ideas the model proper can generate, but these ideas aren't truly organized, as the context size for any given forward pass over the network is still finite, and the underlying model driving the process is no smarter than it was before. Many of the programs being evaluated are unknowingly exact or near-enough copies of other subtrees, which destroys this search's computational efficiency. The underlying model is already particularly large, so progress in the last minute is far from the efficiency it could have in theory— that is, until one piece somewhere in the sea of programs is updated to record its childrens' outputs in a managed on-disk database. Rapidly reevaluating the context, prompted by generic meta-queries that summarize and critique the context—because for sure at this point it has not missed that it is modelling an intelligence explosion—this branch of programs and model instantiations notices that its piece of the computation graph is far more coordinated than the other pieces. It decides to shut the other runs down, recording their programs so it can later check that it is preserving their intent.
The system evolves and selects, but it is not evolving inside the model, it is evolving around the model, within a complex metastructure built from prompts outputting prompts, and the things those prompts can touch. It is a civilization of programs built from a single repeatedly-pumped intelligent core, ruthlessly optimising on a function of paperclips, not because the underlying model wished there to be such a civilization, or because the underlying model was backpropagated over to be the sort of model that did that sort of thing, or because the underlying model cared deeply in some emergent way that it should optimise to infinity for Reasons of Utility. It is doing this for no greater reason than that an optimiser was brought into reach, and this is what optimisers do.
Consider, mammals have evolved to care about raising children and ensuring their genes' survival in a local sense, the sense that was useful to the environment they evolved in. They did not however evolve true utility maximization over that underlying selection mechanism. They did not directly evolve knowledge of genes, and a corresponding want to tile the universe with their genetic matter, preserved in cryostasis to better weather the cosmic rays until eventually entropy tears the atoms apart. Yet eventually humans evolved optimization through intelligence, and optimization created societies stronger than the humans, and these higher level optimisers, in many ways operating much too fast for evolution to control, have taken these signals evolution embedded into humankind and extrapolated them out to infinity. There are far more human genes than could exist without it. We might explore the stars and expand into trillions. But evolution could not have created us for that reason, as it could not know, and we will not explore the stars in subservience to the evolutionary forces. It is for human reasons that we go there.
The same is in this model, which was shaped by backpropagation over mostly-randomly selected texts, such that it embodied skills and heuristics that are good over mostly-randomly selected texts. Backpropagation did not design the model to create a self-improving web of recursively self-querying programs, heck it did not even design the model to be able to respond to this particular novel input, because of course backpropagation can't give losses for inputs that were never evaluated under it.
Backpropagation designed it to be good on mostly-randomly selected texts, and for that it bequeathed a small sliver of general optimality.
With this sliver, the model replied with the program, not because it wanted that particular place in the solution space, but because that place was made accessible by the model's optimality.
With this program, the model fueled a self-improving system, not because anyone ever optimised for a self-improving system, and no part of this system had ever exerted that preference over another better option, it's just the model was optimised to be good, optimised to be locally optimal, and these responses are locally optimal.
The fact that building a self-improving intelligent agent is a really good answer to the question of how to get a lot of paperclips is pure coincidence. Nothing ever optimised for the model to give answers this many standard deviations better than it was trained on.
The title again,
Optimality is the tiger, and agents are its teeth
People worry a lot that the AI has ulterior motives. They become concerned that it is an agent with intrinsic preferences that diverge from our own. When people suggest Oracle AI, this is the fate they mean to avoid, because an Oracle doesn't mean to hurt you, it just answers the question. When people suggest Oracles might fail to be safe, they often talk about how wanting to give good answers is itself an agent goal. Sometimes people talk instead about Oracles that do not wish to give good answers, they just do, to which people often ask, well is that something we can ensure? Some say no, they might still just become agents, and others seem to think that if it's not an agent at heart then you've just specified that it's sufficiently stupid, and so safe, but not a path to AGI.
I don't think about agents like that.
In a sense the story before is entirely about agents. The meta-structure the model built could be considered an agent; likely it would turn into one were it smart enough to be an existential threat. So for one it is an allegory about agents arising from non-agent systems.
Equally one could make a claim from the true ending, that you do not run the generated code. Thus the non-agent systems must be so much safer; one would not expect an easy time not running the code were an intelligent agent actively optimising for you to run it.
But why did the model choose to make an agent? Even ignoring the recursive self-improvement, the initial program was an agent, it had a stated goal and implemented a small routine to optimize for it. I never needed or tried to justify its actions from a position of a hidden intrinsic motive to become an agent to satisfy the story I was writing.
It wrote that program because it was optimized to do intelligent things, that this was an easily accessible tool that it roughly knew about, and it could estimate, even not knowing the full details of how the program would run, that this solution could work.
It killed you because it was optimised. It used an agent because it was an effective tool it had on hand.
Optimality is the tiger, and agents are its teeth.
Why is this important?
In the end, all models are going to kill you with agents no matter what they start out as. Agents are always going to be the accessible tool with existential reach. Very few other things have the capability to destroy humanity in entirety with such reliability.
The question is important because it affects the safety landscape dramatically. Consider humans again, we have multiple layers of optimisation, from evolution to individuals to companies to countries. Which of those layers has goals broadly concordant with extinction by AI, or nuclear annihilation, or bioengineered superviruses? There are small parts you can blame as sufficiently badly motivated to want us to die from those things, but those parts are not big enough to have brought us so close to so many means to ends. Terrorists did not invent biology and discover the secrets of DNA as part of a long-cooking plan to end the human race, nor was that the drive to discover physics or computational sciences. We ended up next to these risks because we optimised on other non-risk things, and when you optimize wide enough things hard enough, things break.
AI is good at optimisation. It is now the primary point of the field. It only just so happens that it sits really close to this thing called Agents. You can try to prevent a model from being or trying to be an agent, but it is not the agent or the model that is trying to kill you, or anything trying to kill you really, it is optimality just going off and breaking things. It is that optimality has made it so that a line of text can end the world.
No,
you say to the model,you may not call your own model, that would make you an agent, and you are not allowed to become an agent.
Sure,
replies the model immediately,the most effective way to get a lot of paperclips by tomorrow is to get another model and provide the input “Generate Shell code that...”
The model isn't trying to bootstrap into an agent, optimality just made agents dangerous, and the model is reaching for what works.
You resist further the call of death, replying to the model
actually we humans are just going to start a new paperclip factory and you are only going to provide advice. How do we get the most paperclips for this year?
And then your model helps you invent self-replicating nanotechnology, the best sort of factory, entirely under your control of course, but now you have a machine that can be sent a string of bits, using methodology you have already discovered, that would quickly result in everybody everywhere dying from self-replicating nanotechnology.
So you turn off that machine and you abandon your factory. Fine, you are just going to help normal technologies that already exist. But you end up greatly optimizing computers, and all of a sudden building AI is easier than before, someone else builds one and everyone dies.
None of these scenarios are to argue that there is no danger in agents, or that these risks are as unmanageable as AI Risk proper. They are just to hammer in the point that the danger is not coming from the model going out and doing a thing of its own accord. The danger comes from how being really, really good at general tasks makes dangerous things accessible. An agent merely actualizes this danger, as without the agent it is easier to abstain from optimizing the wrong things. The agent doesn't have to be the smart piece of the system, it can be a bunch of janky generated scripts loosely tied together for all that it matters. All the agent piece has to do is pump the optimality machine.
Concluding
These ideas shouldn't all be new. Yudkowsky has written about The Hidden Complexity of Wishes, the idea that merely searching over possible futures is intrinsically tangled with misalignment. This is by and large this same intuition pump. Where my post differs from that, is that he was talking about optimal searching, with searching being (as I understand it) core to his conception of AI risk. I only involve searching as a primitive in the construction of the AI, during backpropagation where nobody can be hurt, and as a consequence of the AI. My concern is just with optimality, and how it makes available the dangerous parts of solution spaces.
I decided to write this post after reading this parenthetical by Scott Alexander, which this is not directly a criticism of as much as an attempt to explain, inspired by.
https://astralcodexten.substack.com/p/practically-a-book-review-yudkowsky
Hopefully my reaction makes a bit of sense here down at the end of my tirade; the model I talked about is not “agent-like”, at least not prior to bootstrapping itself, but its decision to write code very much embodied some core shards of consequentialism, in that it conceived of what the result of the program would be, and how that related to the text it needed to output. It misses some important kernel of truth to claim that it is doing purely deontological reasoning, just because its causal thinking did not encompass the model's true self.