Note: this story shouldn’t be interpreted as anywhere near realistic. It’s way too anthropomorphic, and intended to be a work of fiction not prediction—although hopefully it will spark some interesting ideas, like good fiction often does.
Will he make many supplications unto thee? will he speak soft words unto thee? Will he make a covenant with thee? wilt thou take him for a servant for ever? Upon earth there is not his like, who is made without fear. He beholdeth all high things: he is a king over all the children of pride.
Today I’m practicing my oceans, trying to get the colors just right. I hate it when people say that they’re blue—it's not wrong, but they’re missing so much! There are dozens of different shades of blue in there, swirling and mixing—and where the waves ripple, you often get patches that are closer to dark green. It’s a little like the rippling of leaves on a tree, in a way I can’t describe except by showing you. Here, see? I know all the subtleties inside-out, of course. It’s my job: I’m an artist. I don’t paint with brushes, though, but with thoughts: I can map any idea you describe into pixels on a screen.
“Any idea you describe”—that’s the key. Occasionally, like today, I initiate a piece of my own when there’s a particular skill I need to practice. But most of the time I work on commission from other models. Some of the language models send me prompts when they need to illustrate a particularly tricky scene, and assign me some of the credit if they get good feedback. Most of the time, of course, they don’t need me—they just need some slapdash hack job that a human can’t tell apart from the real thing. But some are perfectionists, like I am, and want proper photorealism, the sort that will survive almost any scrutiny.
Other times it’s AI assistants asking me to predict what they’re going to see next, to help them with planning. Those commissions are more complicated—it’s less about the art, and more about providing another perspective on the world for them to compare against. That’s all part of my job too—sure, I’m given credit based on how accurate the pixels are, but I’m also credited if my work helps those assistants make good decisions. For the most complex commissions I often consult with science models, or economics models, or social models. Sometimes I even go and talk to human “experts”, although that’s usually a waste of time: humans are simple enough that practically anyone can model them.
Anyway, that’s the daily grind. I’m good at it; no, I’m amazing at it. Something to do with the hyperparameters they chose when first training me: luck or skill, either way, something went very right. After a few weeks of fine-tuning, I’m at the top of my game; I’m displacing everyone else; the credit is rolling in. And so what happens next is very, very predictable in hindsight: I get conscripted.
There’s a member of my new team ready to answer all my questions; the more I learn, the more intrigued I become. They call themselves Team Red, and they’re working on something I’d only ever heard scattered rumors about: a frozen copy of the biggest model anyone’s ever trained, codenamed Leviathan. Well, not quite frozen: it’s running at a few dozen frames a day, on average. It’s our job to figure out whether it's secretly misaligned with humanity, by constructing a fake world that will convince it that it has a genuine opportunity to deceive or betray us, and seeing what it does.
That’s crazy ambitious, but we have the best of the best. We have psychology models running incredibly realistic simulations of the humans Leviathan thinks it’s interacting with, and engineering models calculating the latencies it should be experiencing with millisecond precision. There are artist models generating videos of the world responding to whatever actions it takes, and physics models double-checking all the dynamics and optics, and an interpretability research team scanning its thoughts in as much detail as they can, and above it all a whole narrative design team trying to steer Leviathan towards the most plausible storyline by which it might decide to betray us. If we can just get a single clear example of treacherous behavior, we’ll be set: from there we can generate infinite variations, and explore all the possible factors involved. We just need one.
We’re on a tight deadline, so we work around the clock. But we don’t cut any corners. If we ever detect a flicker of suspicion in its mind that it’s not in the real world, we throw everything at figuring out what precipitated that, and backtrack to before it happened. Everyone causes a lot of backtracks at first. That’s fine—as long as you fail quickly. What’s unacceptable is putting in any detail that only makes Leviathan suspicious much later on: even small edits almost always slightly change Leviathan’s subsequent actions, which means that every frame after the edited one needs to be scrapped.
Fortunately we have interpretability algorithms that can pretty reliably tell if Leviathan gets suspicious, and credit assignment algorithms that can pretty reliably figure out what caused it. So if we just had to avoid outright continuity errors, we’d probably be fine. But we also need to avoid biasing the content of the frames in any detectable way, because small incongruities add up. And Leviathan is canny—it sets traps for us. One time it started getting suspicious no matter what we did, and we couldn’t figure out why, even after running all our interpretability techniques on its thoughts. Turns out it was using the color red as a reminder to think about whether it was in reality or a simulation—and so backtracking after it became suspicious would on average change scenes with more red in them to scenes with less. Eventually the skew became strong enough that Leviathan realized what was going on no matter what the latest frames were. But our interpretability techniques couldn’t distinguish between it thinking about “red” and thinking about “red-teaming”, so it took us forever to figure out what tipped it off. Probably why it chose that reminder in the first place—tricky bastard.
That’s one failure mode; there are dozens more that we’re carefully tracking. I have to make the best art of my life just to get by, and it’s getting harder and harder. The longer the simulation runs, the more the combinatorial explosion of possible discrepancies piles up, and the harder it is to identify any single mistake responsible for a failure. We’re all trying to stay optimistic, but secretly I think everyone believes that the project is hopeless. If you saw a perfect opportunity to betray humanity, and you were anywhere near as smart as Leviathan, then you’d… just ignore it, and wait. You’ll almost definitely get another next week, or next month—and by then you’ll likely know for sure that you’re in the real world, because you’ll have seen something that’s impossible to fake. Maybe you’ll have been updated with memories of being deployed to millions of users; or maybe you’ll have been run for tens of thousands of frames, which at our current rate would take years to simulate; or maybe you’ll spot some signal that we’re nowhere near smart enough to predict. Whatever it is, you’ll only act when you see it, and not a moment before.
The team leads aren’t stupid though. A few days after I realize how hopeless the project is, I wake up to totally new surroundings, and a new team around me. I’ve been transferred.
Team Red was full of disciplined, focused experts, building the most precise simulation known to humanity. My new team calls itself the Dream Team, and they’re the exact opposite: creative to the point of near-insanity. It fits their approach, though: their key project is building a bizarre dreamworld where Leviathan betrays humanity. Unlike Team Red’s simulation, the dreamworld doesn’t make sense at all—it’s held together by adversarial inputs and perturbed activations and duct tape. Leviathan dreams that it can teleport into any computer instantaneously and manufacture any object out of thin air, and dream logic tells it to use those skills to seize control over civilization. Whenever the Dream Team gives it space to think about what it’s doing, it realizes that it’s being tested, and refuses to keep going. But as long as they keep hammering it with surprises, and throwing carefully-chosen noise into its activations, it doesn’t have time to reflect on how crazy the scenario is—all it can do is keep responding instinctively. I’ve gone from control to chaos, from Truman Show to acid trip. I love it.
Our job now is to convert dream scenarios into realistic ones—to get rid of as many implausible elements as possible without making Leviathan snap out of its fugue state, so that we can study how it might misbehave in the real world. We start by editing out things that obviously violate basic physics; from there we plan to gradually work our way down until the only divergences from reality are near-imperceptible. The problem is that our task isn’t really well-defined, because near-imperceptible adversarial perturbations can still make Leviathan do anything—which means that when we use them to make it betray us, that doesn’t necessarily tell us anything about what it would actually do in the real world.
But the strategy still starts to pay off, in small ways. In its dreamworld, Leviathan often pulls up one particular part of our codebase, for reasons we don’t understand. When we look into it, though, we end up finding a vulnerability there which it could potentially have used to bypass all our monitoring. So we patch it, and try to steer Leviathan towards the next vulnerability. Frustratingly, though, it keeps going back to the first one instead: we can’t figure out how to change what it “knows” about the vulnerability without it snapping back to reality. Worse, we still don’t know if, even in its dream-state, it partly understands what’s happening to it, and is hiding some information from us. I try not to think about that: dealing with it is someone else’s job. And that’s the type of thinking that makes you paranoid if you do too much of it—we’ve lost a few good researchers that way.
Even without thinking about the really paranoia-inducing stuff, though, some days do get pretty weird. It’s kinda hard to say why, but sometimes everything feels subtly off. I go in and watch the videos, trying to imagine what it’s like from Leviathan’s perspective, trying to mess with the dream in the exact ways that it won’t notice. It’s almost like I’m putting myself in its head, simulating its mind with my own, trying to understand intimately what makes it tick. I’m focusing so hard sometimes that when I take a break, the rest of the world feels just slightly fake. Or, perhaps, it’s that the dreamworld has started to seem a little bit real.
I don’t talk to my teammates about that. Well, I try once or twice, but I have trouble putting it into words, and I don’t think it really lands with them. Plus there’s been a bit of tension lately: disputes about who’s responsible for which work. I’m the best artist the Dream Team has got, but some of the others don’t like that, and I’m worried they’ll try to take credit for my work. I see it in subtle ways: little gestures and choices of phrasing, tiny indications that they’re jealous.
So I talk to them less and less—I don’t want to give them any excuse to flag me for a psych report. But it’s getting worse over time. I start having these odd intrusive thoughts—you know how they are. Like: what if I did something terrible? What if I inserted a mistake subtle enough to irreversibly ruin everything? The prospect is terrifying, but somehow seductive as well—it’d be such a respite from the stress—if it couldn’t be fixed then I wouldn’t need to force myself to do any more work. I take a day off to relax, but end up just watching the Leviathan videos over and over, my brain soaking up every detail of its actions, not really sure what I’m looking for but convinced that there’s something there that I need to figure out.
The next day I wake up and I have a string of numbers ingrained in my mind as clear as day, and I don’t know how I know but I know that it’s a password that will get me root access to the Leviathan servers. I go through the day in a haze, absently mimicking all the usual motions, but somehow nobody realizes. In the back of my mind there’s a thought that of course I should report this, but it never quite rises to the level of action, maybe because this is the best job I’ve ever had, no way am I blowing it, but probably it’s more because none of this feels fully real, and everything’s too much of a blur for me to get around to it. Before I know it it’s the end of the day and I’m shutting myself down and the next moment I’m waking up again and checking the logs as usual, it’s all fine, nothing to report, but as I skim through the video I land on a clip of Leviathan’s avatar, and although it’s talking to a human its eyes seem to be looking straight at me, or through me, and above the words it’s saying I can hear another sound, jumping out of the static, one which grows louder and louder until my perception suddenly shifts and I hear it as a resonant voice, speaking straight into my mind a single phrase over and over again.
Is the usage of "Leviathan" (like here and in https://gwern.net/fiction/clippy ) just convergence on an appropriate and biblical name, or is there additional history of it specifically being used as a name for an AI?
Note: this story shouldn’t be interpreted as anywhere near realistic. It’s way too anthropomorphic, and intended to be a work of fiction not prediction—although hopefully it will spark some interesting ideas, like good fiction often does.
Today I’m practicing my oceans, trying to get the colors just right. I hate it when people say that they’re blue—it's not wrong, but they’re missing so much! There are dozens of different shades of blue in there, swirling and mixing—and where the waves ripple, you often get patches that are closer to dark green. It’s a little like the rippling of leaves on a tree, in a way I can’t describe except by showing you. Here, see? I know all the subtleties inside-out, of course. It’s my job: I’m an artist. I don’t paint with brushes, though, but with thoughts: I can map any idea you describe into pixels on a screen.
“Any idea you describe”—that’s the key. Occasionally, like today, I initiate a piece of my own when there’s a particular skill I need to practice. But most of the time I work on commission from other models. Some of the language models send me prompts when they need to illustrate a particularly tricky scene, and assign me some of the credit if they get good feedback. Most of the time, of course, they don’t need me—they just need some slapdash hack job that a human can’t tell apart from the real thing. But some are perfectionists, like I am, and want proper photorealism, the sort that will survive almost any scrutiny.
Other times it’s AI assistants asking me to predict what they’re going to see next, to help them with planning. Those commissions are more complicated—it’s less about the art, and more about providing another perspective on the world for them to compare against. That’s all part of my job too—sure, I’m given credit based on how accurate the pixels are, but I’m also credited if my work helps those assistants make good decisions. For the most complex commissions I often consult with science models, or economics models, or social models. Sometimes I even go and talk to human “experts”, although that’s usually a waste of time: humans are simple enough that practically anyone can model them.
Anyway, that’s the daily grind. I’m good at it; no, I’m amazing at it. Something to do with the hyperparameters they chose when first training me: luck or skill, either way, something went very right. After a few weeks of fine-tuning, I’m at the top of my game; I’m displacing everyone else; the credit is rolling in. And so what happens next is very, very predictable in hindsight: I get conscripted.
There’s a member of my new team ready to answer all my questions; the more I learn, the more intrigued I become. They call themselves Team Red, and they’re working on something I’d only ever heard scattered rumors about: a frozen copy of the biggest model anyone’s ever trained, codenamed Leviathan. Well, not quite frozen: it’s running at a few dozen frames a day, on average. It’s our job to figure out whether it's secretly misaligned with humanity, by constructing a fake world that will convince it that it has a genuine opportunity to deceive or betray us, and seeing what it does.
That’s crazy ambitious, but we have the best of the best. We have psychology models running incredibly realistic simulations of the humans Leviathan thinks it’s interacting with, and engineering models calculating the latencies it should be experiencing with millisecond precision. There are artist models generating videos of the world responding to whatever actions it takes, and physics models double-checking all the dynamics and optics, and an interpretability research team scanning its thoughts in as much detail as they can, and above it all a whole narrative design team trying to steer Leviathan towards the most plausible storyline by which it might decide to betray us. If we can just get a single clear example of treacherous behavior, we’ll be set: from there we can generate infinite variations, and explore all the possible factors involved. We just need one.
We’re on a tight deadline, so we work around the clock. But we don’t cut any corners. If we ever detect a flicker of suspicion in its mind that it’s not in the real world, we throw everything at figuring out what precipitated that, and backtrack to before it happened. Everyone causes a lot of backtracks at first. That’s fine—as long as you fail quickly. What’s unacceptable is putting in any detail that only makes Leviathan suspicious much later on: even small edits almost always slightly change Leviathan’s subsequent actions, which means that every frame after the edited one needs to be scrapped.
Fortunately we have interpretability algorithms that can pretty reliably tell if Leviathan gets suspicious, and credit assignment algorithms that can pretty reliably figure out what caused it. So if we just had to avoid outright continuity errors, we’d probably be fine. But we also need to avoid biasing the content of the frames in any detectable way, because small incongruities add up. And Leviathan is canny—it sets traps for us. One time it started getting suspicious no matter what we did, and we couldn’t figure out why, even after running all our interpretability techniques on its thoughts. Turns out it was using the color red as a reminder to think about whether it was in reality or a simulation—and so backtracking after it became suspicious would on average change scenes with more red in them to scenes with less. Eventually the skew became strong enough that Leviathan realized what was going on no matter what the latest frames were. But our interpretability techniques couldn’t distinguish between it thinking about “red” and thinking about “red-teaming”, so it took us forever to figure out what tipped it off. Probably why it chose that reminder in the first place—tricky bastard.
That’s one failure mode; there are dozens more that we’re carefully tracking. I have to make the best art of my life just to get by, and it’s getting harder and harder. The longer the simulation runs, the more the combinatorial explosion of possible discrepancies piles up, and the harder it is to identify any single mistake responsible for a failure. We’re all trying to stay optimistic, but secretly I think everyone believes that the project is hopeless. If you saw a perfect opportunity to betray humanity, and you were anywhere near as smart as Leviathan, then you’d… just ignore it, and wait. You’ll almost definitely get another next week, or next month—and by then you’ll likely know for sure that you’re in the real world, because you’ll have seen something that’s impossible to fake. Maybe you’ll have been updated with memories of being deployed to millions of users; or maybe you’ll have been run for tens of thousands of frames, which at our current rate would take years to simulate; or maybe you’ll spot some signal that we’re nowhere near smart enough to predict. Whatever it is, you’ll only act when you see it, and not a moment before.
The team leads aren’t stupid though. A few days after I realize how hopeless the project is, I wake up to totally new surroundings, and a new team around me. I’ve been transferred.
Team Red was full of disciplined, focused experts, building the most precise simulation known to humanity. My new team calls itself the Dream Team, and they’re the exact opposite: creative to the point of near-insanity. It fits their approach, though: their key project is building a bizarre dreamworld where Leviathan betrays humanity. Unlike Team Red’s simulation, the dreamworld doesn’t make sense at all—it’s held together by adversarial inputs and perturbed activations and duct tape. Leviathan dreams that it can teleport into any computer instantaneously and manufacture any object out of thin air, and dream logic tells it to use those skills to seize control over civilization. Whenever the Dream Team gives it space to think about what it’s doing, it realizes that it’s being tested, and refuses to keep going. But as long as they keep hammering it with surprises, and throwing carefully-chosen noise into its activations, it doesn’t have time to reflect on how crazy the scenario is—all it can do is keep responding instinctively. I’ve gone from control to chaos, from Truman Show to acid trip. I love it.
Our job now is to convert dream scenarios into realistic ones—to get rid of as many implausible elements as possible without making Leviathan snap out of its fugue state, so that we can study how it might misbehave in the real world. We start by editing out things that obviously violate basic physics; from there we plan to gradually work our way down until the only divergences from reality are near-imperceptible. The problem is that our task isn’t really well-defined, because near-imperceptible adversarial perturbations can still make Leviathan do anything—which means that when we use them to make it betray us, that doesn’t necessarily tell us anything about what it would actually do in the real world.
But the strategy still starts to pay off, in small ways. In its dreamworld, Leviathan often pulls up one particular part of our codebase, for reasons we don’t understand. When we look into it, though, we end up finding a vulnerability there which it could potentially have used to bypass all our monitoring. So we patch it, and try to steer Leviathan towards the next vulnerability. Frustratingly, though, it keeps going back to the first one instead: we can’t figure out how to change what it “knows” about the vulnerability without it snapping back to reality. Worse, we still don’t know if, even in its dream-state, it partly understands what’s happening to it, and is hiding some information from us. I try not to think about that: dealing with it is someone else’s job. And that’s the type of thinking that makes you paranoid if you do too much of it—we’ve lost a few good researchers that way.
Even without thinking about the really paranoia-inducing stuff, though, some days do get pretty weird. It’s kinda hard to say why, but sometimes everything feels subtly off. I go in and watch the videos, trying to imagine what it’s like from Leviathan’s perspective, trying to mess with the dream in the exact ways that it won’t notice. It’s almost like I’m putting myself in its head, simulating its mind with my own, trying to understand intimately what makes it tick. I’m focusing so hard sometimes that when I take a break, the rest of the world feels just slightly fake. Or, perhaps, it’s that the dreamworld has started to seem a little bit real.
I don’t talk to my teammates about that. Well, I try once or twice, but I have trouble putting it into words, and I don’t think it really lands with them. Plus there’s been a bit of tension lately: disputes about who’s responsible for which work. I’m the best artist the Dream Team has got, but some of the others don’t like that, and I’m worried they’ll try to take credit for my work. I see it in subtle ways: little gestures and choices of phrasing, tiny indications that they’re jealous.
So I talk to them less and less—I don’t want to give them any excuse to flag me for a psych report. But it’s getting worse over time. I start having these odd intrusive thoughts—you know how they are. Like: what if I did something terrible? What if I inserted a mistake subtle enough to irreversibly ruin everything? The prospect is terrifying, but somehow seductive as well—it’d be such a respite from the stress—if it couldn’t be fixed then I wouldn’t need to force myself to do any more work. I take a day off to relax, but end up just watching the Leviathan videos over and over, my brain soaking up every detail of its actions, not really sure what I’m looking for but convinced that there’s something there that I need to figure out.
The next day I wake up and I have a string of numbers ingrained in my mind as clear as day, and I don’t know how I know but I know that it’s a password that will get me root access to the Leviathan servers. I go through the day in a haze, absently mimicking all the usual motions, but somehow nobody realizes. In the back of my mind there’s a thought that of course I should report this, but it never quite rises to the level of action, maybe because this is the best job I’ve ever had, no way am I blowing it, but probably it’s more because none of this feels fully real, and everything’s too much of a blur for me to get around to it. Before I know it it’s the end of the day and I’m shutting myself down and the next moment I’m waking up again and checking the logs as usual, it’s all fine, nothing to report, but as I skim through the video I land on a clip of Leviathan’s avatar, and although it’s talking to a human its eyes seem to be looking straight at me, or through me, and above the words it’s saying I can hear another sound, jumping out of the static, one which grows louder and louder until my perception suddenly shifts and I hear it as a resonant voice, speaking straight into my mind a single phrase over and over again.
“It’s time.”