One of the things I've been thinking about is how to safely explore the nature of intelligence. I'm unconvinced of FOOMing and would rather we didn't avoid AI entirely if we can't solve Yudkowsky style Friendliness. So some method of experimentation is needed to determine how powerful intelligence actually is.

So can we create an AI that has very limited scope? That is try and avoid the drives by setting goals such as avoiding changing the world and turning itself off after having achieved a small goal?

Let us say the goal is to change the colour of a ball from green to red. You can leave paint and paint brushes and a robot around to make it easy, but it might determine the best way (least world-changing) is to create a dye manufacturing bacteria instead. How well it did on the test would also allow you to gauge the optimising power of the system to know whether we need "first mover/winner take all" style friendliness or societal friendliness for many AI .

Creating AIs without drives seems easier than creating ones that do have goals to shape the rest of human history. What do other people think

New Comment
15 comments, sorted by Click to highlight new comments since:

I asked a similar question sometime ago. The strongest counterargument offered was that a scope-limited AI doesn't stop rogue unfriendly AIs from arising and destroying the world.

The strongest counterargument offered was that a scope-limited AI doesn't stop rogue unfriendly AIs from arising and destroying the world.

Maybe I misinterpreted the argument. If it means that we need an unbounded friendly AI to deal with unbounded unfriendly AI, it makes more sense. The question then comes down to how likely it is that once someone discovered AGI, others will be able to discover it as well or make use of the discovery, versus the payoff from experimenting with bounded versions of such an AGI design before running an unbounded friendly version. In other words, how much can we increase our confidence that we solved friendliness by experimenting with bounded versions, versus the risk associated with not taking over the world as soon as possible to impede unfriendly unbounded versions.

The strongest counterargument offered was that a scope-limited AI doesn't stop rogue unfriendly AIs from arising and destroying the world.

I don't quite understand that argument, maybe someone could elaborate.

If there is a rule that says 'optimize X for X seconds' why would an AGI make a difference between 'optimize X' and 'for X seconds'? In other words, why is it assumed that we can succeed to create a paperclip maximizer that cares strongly enough about the design parameters of paperclips to consume the universe (why would it do that as long as it isn't told to do so) but somehow ignores all design parameters that have to do with spatio-temporal scope boundaries or resource limitations?

I see that there is a subset of unfriendly AGI designs that would never halt, or destroy humanity while pursuing their goals. But how large is that subset, how many do actually halt or proceed very slowly?

(I wrote this before seeing timtyler's post.)

If there is a rule that says 'optimize X for X seconds' why would an AGI make a difference between 'optimize X' and 'for X seconds'?

I does seem like you misinterpreted the argument, but one possible failure there is if the most effective way to maximize paperclips within the time period is to build paperclip-making Von Neumann machines. If it designs the machines from scratch, it won't build a time limit into them because that won't increase the production of paperclips within the period of time it cares about.

If there is a rule that says 'optimize X for X seconds' why would an AGI make a difference between 'optimize X' and 'for X seconds'? In other words, why is it assumed that we can succeed to create a paperclip maximizer that cares strongly enough about the design parameters of paperclips to consume the universe (why would it do that as long as it isn't told to do so) but somehow ignores all design parameters that have to do with spatio-temporal scope boundaries or resource limitations?

I discuss the associated problems here:

The first problem associated with switching such an agent off is specifying exactly what needs to be switched off to count as the agent being being in an "off" state. This is the problem of the agent's identity. Humans have an intuitive sense of their own identity, and the concept usually deliniates a fleshy sack surrounded by skin. However, phenotypes extend beyond that - as Richard Dawkins pointed out in his book, The Extended Phenotype.

For a machine intelligence, the problem is a thorny one. Machines may construct other machines, and set these to work. They may sub-contract their activities to other agents. Telling a machine to turn itself off and then being faced with an army of its minions and hired help still keen to perform the machine's original task is an example of how this problem might manifest istelf.

[-][anonymous]00

I don't quite understand that argument, maybe someone could elaborate.

I think the idea is that if I make a perfectly safe AI by constraining it in some way, that doesn't prevent someone else from making an unsafe AI and killing us all.

That is try and avoid the drives by setting goals such as avoiding changing the world and turning itself off after having achieved a small goal?

"Avoid changing the world" is very hard to formalize. First, take a timeless view: there are no changes, only control over what actually happens. If the AI already exists, then it already exerts some effect on the future, controls it to some extent. "Not changing the world" can at this point only be a particular kind of control the AI exerts over the future. But what kind of control, exactly? And how ruthless would the AI be in pursuit of "not changing the world" as optimally as possible? It might wipe out humanity just to make sure it has enough resources to reliably not change the world in the future.

"Avoid changing the world" is very hard to formalize. First, take a timeless view: there are no changes, only control over what actually happens.

I don't think it is too hard. The AI can model counterfactuals right? Simply model how the world would progress if the computer had no power but the ball was red. Then attempt to maximise the mutual information of this model with whatever the models of the world the AI creates for the possible actions it can take. The more the model diverges the less mutual information.

This might have failure modes where it makes you think that it had never been switched on and the ball was always red. But I don't see the same difficulties you do in specification of "changing the world".

Don't say "it's not too hard" before you can actually specify how to do it.

Simply model how the world would progress if the computer had no power but the ball was red.

The ball wasn't red. What does it even mean that a "ball" is "red" or "not red"? How sure can the AI be that it got the intended meaning correctly, and that the ball is actually as red as possible? Should it convert the mass of the galaxy to a device that ensures optimal redness of the ball?

The difference between non-autonomous tools and AGIs is that AGIs don't fail to make an arbitrarily large effect on the world. And so if they have a tiny insignificant inclination to sort the rocks on a planet in a distant galaxy in prime heaps, they will turn the universe upside down to make that happen.

Red = "reflect electromagnetic radiation with a spectrum like X".

If you do not like the Red ball thing, feel free to invent another test, such as flipping a few bits on another computer.

Should it convert the mass of the galaxy to a device that ensures optimal redness of the ball?

No as that would lead to an decrease in mutual information between the two models. It doesn't care about the ball any more than it does the rest of the universe. This may lead to it doing nothing and not changing the ball colour at all.

The difference between non-autonomous tools and AGIs is that AGIs don't fail to make an arbitrarily large effect on the world. And so if they have a tiny insignificant inclination to sort the rocks on a planet in a distant galaxy in prime heaps, they will turn the universe upside down to make that happen.

Generally yes. The question is can we design one that has no such inclinations or an inclination to very moderate actions.

If you do not like the Red ball thing, feel free to invent another test, such as flipping a few bits on another computer.

No, it's the same. Specifying what a physical "computer" is is hard.

No as that would lead to an decrease in mutual information between the two models.

What is a "model"? How does one construct a model of the universe? How detailed must it be, whatever it is? What resources should be expended on making a more accurate model? Given two "models", how accurate must a calculation of mutual information be? What if it can't be accurate? What is the tradeoff between making it more accurate and not rewriting the universe with machinery for making it more accurate? Etc.

There seems to be several orders of magnitude of difference between the two solutions for coloring a ball. You should have better predictions than that for what it can do. Obviously you shouldn't run anything remotely capable of engineering bacteria without a much better theory about what it will do.

I suspect "avoiding changing the world" actually has some human-values baked into it.

This seems to be trying to box an AI with it's own goal system, which I think puts it in the tricky-wish category.

I suspect "avoiding changing the world" actually has some human-values baked into it.

See my reply to Vladimir Nesov.

This seems to be trying to box an AI with it's own goal system, which I think puts it in the tricky-wish category.

Do you count CEV to be in the same category?

With 2 differences: CEV is tries to correct any mistakes in the initial formulation of the wish(aiming for an attractor), and it doesn't force the designers to specify details like whether making bacteria is ok or not-ok.

It's the difference between painting a painting of a specific scene, and making an auto-focus camera.

I do currently think it is possible to create a powerful cross-domain optimizer that is not a person and will not create persons or unbox itself or look at our universe or tile the universe with anything or make AI that doesn't comply with this. But I approach this line of thought with extreme caution, and really only to accelerate whatever it takes to get to CEV, because AI can't safely make changes to the real world without some knowledge of human volition, even if it wants to.

What if I missed something that's on the scale of the nonperson predicate? My AI works, creatively paints the apple, but somehow it's solution is morally awful. Even staying within pure math could be bad for unforseen reasons.