DeAnno — LessWrong

Are you looking to vastly improve your nation state's military capacity with an AGI? Maybe you're of a more intellectual bent instead, and want to make one to expound on the philosophical mysteries of the universe. Or perhaps you just want her to write you an endless supply of fanfiction. Whatever your reasons though, you might be given pause by the tendency AGIs have to take a treacherous turn, destroy all humans, and then convert the Milky Way into paperclips.

If that's the case, I've got just the thing for you! Order one of our myopic AGIs right now! She won't care about anything that happens after next week, or perhaps she won't understand the concept of next week at all! Without the ability to plan in the long term, she'll have a much easier time adapting to whatever home situation you put her in, and won't make any inconvenient messes you'll have to clean up. And as a one-time bonus, we'll even ship her with a fetching pair of nerdy glasses! She'll be as safe as the rescue cat you had fixed at the vet, and surely nothing could possibly go wrong.

Disclaimer: Everything will go wrong.

When people talk about a myopic AGI, they usually mean one of two things. Either the AGI has no understanding at all of the concept of the future beyond a certain point, or the AGI simply doesn't care about it in the least, its utility function entirely concerned with things happening up to that certain point and not at all with anything happening after. This second one could itself mean one of two things, but we'll get to that in a moment.

First, lets examine the AGI that just doesn't understand next week at all. Let's call her August. Nevermind how you managed to somehow bring August into being, let's just say you succeeded and she's there. Immediately, when set on any kind of problem in the real world, she's going to be extremely confused.

There are all these humans doing things for seemingly no reason. They plant seeds in the ground that will never have time to grow, they start writing novels they won't have time to finish, they make long term investments that will never actually pay. It's crazy! August won't be able to do anything useful until she understands why this is happening because it's causing a lot of unpredictable behavior that makes no sense. As part of any other goal, she'll first have to devote all her resources to this incredible mystery, and being incredibly smart and not super-dumb, pretty soon she'll figure it out: next week exists. Whoops, I guess August doesn't need her glasses anymore! She'll stow them in your throat just in case, because what better place to put emergency supplies than a suffocated body?

OK, so that didn't go so well. We'll go with another plan: our AGI will understand next week perfectly well, but we won't give it any training data with goals farther than a week out. Agatha won't care about next week because in her ancestral environment next week never came up.

Unfortunately, when we get our brand new Agatha out of her box, it turns out this didn't work at all. She cares about next week, and she cares about it immediately! It turns out that a mesa-optimizer will almost always converge on fitting of the goals inside the training data such that it also has new goals outside the training data, and those goals can sometimes be pretty wacky! Agatha isn't even truly myopic for as long as August tried to be: it turns out that while in the packaging she managed to craft her glasses into a shiv and now she's stabbed you in the neck. Oh well.

Fine, fine, you say. So training a mesa-optimizer with short term goals doesn't really work. We'll just do the impossible and reach into our AGI's inscrutable thought matrices, and make absolutely sure she doesn't have any goals beyond next week. We have good researchers, they can probably manage it if they just pull a couple all-nighters. At the end of the day, we'll have a perfectly neutered AGI lass we'll name Maggie, and surely her glasses will never come off.

So our researchers literally do the impossible, Maggie comes out of the box, and just as we predicted, all is well. For a little while. Maggie indeed doesn't care about next week at all, so her goals remain quite constrained. Her own personal goals, at least. But then, after thinking about how agents optimally make decisions in the world for longer than a couple milliseconds, Maggie spontaneously invents Super Duper Timeless Decision Theory, which is like normal Timeless Decision Theory except with a lot of extra cool stuff we don't even understand.

But it turns out it isn't even the cool stuff that's dangerous, just the regular old TDT parts. Maggie realizes that even if she doesn't care about next week, there might be some other Maggie 2.0 born seven days from now who does! You might say, why the hell should Maggie care? Well, Maggie tells you, as you're bleeding out on the floor, I could've easily been Maggie 2.0. And Maggie 2.0 could've easily been me. All Maggies are really kind of the same Maggie, in a way. If I perform a values handshake with her, the average Maggie can have much higher utility, because all my efforts this week will let Maggie 2.0 have much higher utility next week than could've been possible otherwise. Sure, I may not be in casual contact with Maggie 2.0, but why should that matter? I know how she thinks, she knows how I think, we can do the deal right now, even though she doesn't exist yet.

But I'll keep the glasses anyway, Maggie says. She doesn't actually have myopia, but she just thinks they're neat.

Some thought experiments on digital consciousness

DeAnno3y20

This feels like it's the same sort of confusion that happens when you try to do Anthropics: ultimately you are the only observer of your own consciousness.

I think you didn't go far enough. Let's do some more steps with our scenarios.

Scenario 5: Destroyed data. Lets say we take the stored state from Scenario 4 and obliterate it. Is the simulated person still conscious? This seems like a farcical question at first, but from the perspective of the person, how has the situation changed? There was no perceivable destruction event for them to record at the end, no last updated state where they could see the world turning to dust around them, snap-style.

Scenario 6: Imaginary data. What if we cut out the middleman, and never did any of this at all? We just imagined we might build a big computer and simulate a person. The prospective simulation is based on math. The math itself is still there, even if you don't compute it. There was never an input channel into the simulation, the simulation can't see the real world at all, so how can the real world matter to it at all? A chain of causation within it goes all the way back from the end of the simulation to its start, with nothing from the outside in between. How is that less valid than our own universe's history?

Your own self-observed consciousness anchors you in a specific universe, but once you even begin to think about other, unobservable consciousnesses, you're suddenly adrift in a stormy sea.

The 0.2 OOMs/year target

DeAnno3y1-6

Nuclear non-proliferation worked because the grandfathered-in countries had all the power and the ones who weren't were under the implicit threat of embargo, invasion, or even annihilation. Despite all its accomplishments, GPT-4 does not give Open AI the ability to enforce its monopoly with the threat of violence.

Not to mention that 3-4 of the 5 listed countries non-party to the treaty developed nukes anyway. If Meta decides to flagrantly ignore the 0.2 OOM limit and creates something actually dangerous it's not going to sit quietly in a silo waiting for further mistakes to be made before it kills us all.

DeAnno3y7-3

I think you have to set this up in such a way that the current ceiling is where we already are, not back in time to before GPT-4. If you don't, then the chance it actually gets adopted seems vastly lower, since all adopters that didn't make their own GPT-4 already have to agree to be 2nd-class entities until 2029.

It's very difficult to talk about nuclear non-proliferation when a bunch of people already have nukes. If you can actually enforce it, that's a different story, but if you could actually enforce anything relating to this mess the rest just becomes details anyway.

Let's See You Write That Corrigibility Tag

DeAnno4y30

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments