Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Don't Fear the Reaper: Refuting Bostrom's Superintelligence Argument

6 sbenthall 01 March 2017 02:28PM

I've put a preprint up on arXiv that this community might find relevant. It's an argument from over a year ago, so it may be dated. I haven't been keeping up with the field much since I wrote it, so I welcome any feedback especially on where the crux of the AI risk debate has moved since the publication of Bostrom's Superintelligence book.

Don't Fear the Reaper: Refuting Bostrom's Superintelligence Argument

In recent years prominent intellectuals have raised ethical concerns about the consequences of artificial intelligence. One concern is that an autonomous agent might modify itself to become "superintelligent" and, in supremely effective pursuit of poorly specified goals, destroy all of humanity. This paper considers and rejects the possibility of this outcome. We argue that this scenario depends on an agent's ability to rapidly improve its ability to predict its environment through self-modification. Using a Bayesian model of a reasoning agent, we show that there are important limitations to how an agent may improve its predictive ability through self-modification alone. We conclude that concern about this artificial intelligence outcome is misplaced and better directed at policy questions around data access and storage.

As I hope is clear from the argument, the point of the article is to suggest that to the extent AI risk is a problem, we should shift our focus away from AI theory and more towards addressing questions of how we socially organize data collection and retention.

Autonomy, utility, and desire; against consequentialism in AI design

3 sbenthall 03 December 2014 05:34PM

For the sake of argument, let's consider an agent to be autonomous if:


  • It has sensors and actuators (important for an agent)
  • It has an internal representation of its goals. I will call this internal representation its desires.
  • It has some kind of internal planning function that given sensations and desires, chooses actions to maximize the desirability of expected outcomes



I want to point to the desires of the agent specifically to distinguish them from the goals we might infer for it if we were to observe its actions over a long period of time. Let's call these an agent's empirical goals. (I would argue that it is in many cases impossible to infer an agent's empirical goals from its behavior, but that's a distraction from my main point so I'm just going to note it here for now.)

I also want to distinguish them from the goals it might arrive on stably if it were to execute some internal goal-modification process optimizing for certain conditions. Let's call these an agent's reflective goals.

The term utility function, so loved by the consequentialists, frequently obscures these important distinctions. An argument that I have heard for the expansive use of the term "utility function" in describing agents is is: The behavior of all agents can be characterized using a utility function, therefore all agents have utility functions. This argument depends on a fallacious conflation of desire, empirical goals, and reflective goals.

An important problem that I gather this community thinks about deeply is how to think about agents whose reflective goals are different from its present desires--say, the desires I have and have transferred over to it. For example, if I want to program an agent with desires and the capacity to reflect, then can I guarantee that it will be faithful to the desires I intended for it?

A different problem that I am interested in is what other classes of agents there may be besides autonomous agents. Specifically, what if an agent does not have an internal representation of its desires. Is that possible? Part of my motivation for this is my interest in Buddhism. If an enlightened agent is one with no desires, how could one program a bodhisattva AI whose only motivation was the enlightenment of all beings?

An objection to this line of thinking is that even an agent that desires enlightenment for all beings has desires. It must in some formal sense have a utility function, because Von Neumann. Right?

I'm not so sure, because complexity.

To elaborate: if we define a utility function that is so complex (either in its compressed "spatial" representation such as its Kolmogorov complexity or in some temporal dimension of complexity like its running time or logical depth) that it cannot be represented internally to an agent because it lacks the capacity to do so, then it would be impossible for the agent to have that utility function as its desire.

However, such a utility function could be ascribed to the agent as its empirical goal if the agent were both internally composed and embedded in the world in such a way that it acted as if it had those complex desires. This is consistent with, for example, Buddhist writings about how enlightened beings act with spontaneous compassion. Goal oriented planning does not seem to get in the way here.

How could an AI be compassionate? Perhaps an AI could be empathetic if it could perceive, through its sensors, the desires (or empirical goals, or reflective goals) of other agents and internalize them as its own. Perhaps it does this only temporarily. Perhaps it has, in place of a goal-directed planning mechanism, a way of reconciling differences in its internalized goal functions. This internal logic for reconciling preferences is obviously critical for the identity of the agent and is the Achilles heel of the main thrust of this argument. Surely that logic could be characterized with a utility function and would be, effectively, the agent's desire?

Not so, I repeat, if that logic were not a matter of internal representation as much as the logic of the entire system composed of both the agent and its environment. In other words, if the agent's desires are identical to the logic of the entire world within which it is a part, then it no longer has desire in the sense defined above. It is also no longer autonomous in the sense defined above. Nevertheless, I think it is an important kind of agent when one is considering possible paradigms of ethical AI. In general, I think that drawing on non-consequentialist ethics for inspiration in designing "friendly" AI is a promising research trajectory.


more on predicting agents

0 sbenthall 08 November 2014 06:43AM

Suppose you want to predict the behavior of an agent. I stand corrected. To make the prediction, as a predictor you need:

  • observations of the agent
  • the capacity to model the agent to a sufficient degree of accuracy

"Sufficient accuracy" here is a threshold on, for example, KL divergence or perhaps some measure that depends on utilities of predictions in the more complex case.

When we talk about the intelligence of a system, or the relative intelligence between agents, one way to think of that is the ability for one agent to predict another. 

Consider a game where an agent, A, acts on the basis of an arbitrarily chosen polynomial function of degree k. A predictor, P, can observe A and build predictive models of it. Predictor P has the capacity to represent predictive models that are polynomial functions of degree j.

If j > k, then predictor P will in principal be able to predict A with perfect accuracy. If j < k, then there most of the time be cases where P predicts inaccurately. If we say (just for the sake of argument) that perfect predictive accuracy is the test for sufficient capacity, we could say that in the j < k case P does not have sufficient capacity to represent A.

When we talk about the relative intelligence between agents in an adversarial context, this is one way to think about the problem. One way that an agent can have a decisive strategic advantage over another is if it has the capacity to predict the other agent and not vice-versa.

The expressive power of the model space available to P is only one of the ways in which P might have or not have capacity to predict A. If we imagine the prediction game extended in time, then the computational speed of P--what functions it can compute within what span of real time--relative to the computational speed of A could be a factor.

Note that these are ways of thinking about the relative intelligence between agents that do not have anything explicitly to do with "optimization power" or a utility function over outcomes. It is merely about the capacity of agents to represent each other.

One nice thing about representing intelligence in this way is that it does not require an agent's utility function to be stable. In fact, it would be strange for an agent that became more intelligent to have a stable utility function, because the range of possible utility functions available to a more intelligent agent are greater. We would expect that an agent that grows in its understanding would change its utility function--if only because to do so would make it less predictable to adversarial agents that would exploit its simplicity.

Comment author: Dagon 06 November 2014 09:19:36AM 1 point [-]

Don't focus on internal knowledge vs black-box prediction, instead think of model complexity and how big our constructed model has to be in order to predict correctly.

A human may be its own best model, meaning that perfect prediction requires a model at least as complex as the thing itself. Or the internals may contain a bunch of redundancy and inefficiency, in which case it's possible to create a perfect model of behavior and interaction that's smaller than the human itself.

If we build the predictive model from sufficient observation and black-box techniques, we might be able to build a smaller model that is perfectly representative, or we might not. If we build it solely from internal observation and replication, we're only ever going to get down to the same complexity as the original.

I include hybrid approaches (use internal and external observations to build models that don't operate identically to the original mechanisms) in the first category: that's still black-box thinking - use all info to model input/output without blindly following internal structure.

Comment author: sbenthall 08 November 2014 06:14:48AM 0 points [-]

This seems correct to me. Thank you.

Comment author: Wes_W 04 November 2014 04:18:36PM 2 points [-]

I'm not clear on the distinction you're drawing. Can you give a concrete example?

I don't know how cars work, but almost nothing my car does can surprise me. Only unusual one-off problems require help from somebody who knows the internal structure.

But cars are designed to be usable by laypeople, so this is maybe an unfair example.

Comment author: sbenthall 08 November 2014 06:13:24AM 0 points [-]

You don't know anything about how cars work?

Comment author: ChristianKl 04 November 2014 02:58:16PM 3 points [-]

It's possible to predict the behavior of black boxes without knowing anything about their internal structure.

In general, we can say that people do not have the capacity to explicitly represent other people very well. People are unpredictable to each other. This is what makes us free. When somebody is utterly predictable to us, their rigidity is a sign of weakness or stupidity.

That says a lot more about your personal values then the general human condition. Many people want romantic partners that understand them and don't associate this desire with either party being weak or stupid.

We are able to model the internal structure of worms with available computing power.

What do you mean with that sentence? It's obviously true because you can model anything. You can model cows as spherical bodies. We can model human behavior as well. Both our models of worms and of humans aren't perfect. The models of worms might be a bit better at predicting worm behavior but they are not perfect.

Comment author: sbenthall 08 November 2014 06:11:27AM 0 points [-]

It's possible to predict the behavior of black boxes without knowing anything about their internal structure.


That says a lot more about your personal values then the general human condition.

I suppose you are right.

The models of worms might be a bit better at predicting worm behavior but they are not perfect.

They are significantly closer to being perfect than our models of humans. I think you are right in pointing out that where you draw the line is somewhat arbitrary. But the point is the variation on the continuum.

Comment author: SolveIt 04 November 2014 08:56:05AM 5 points [-]

Really? I suppose it depends on what you mean by an agent, but I can know that birds will migrate at certain times of the year while knowing nothing about their insides.

Comment author: sbenthall 08 November 2014 06:07:34AM 0 points [-]

Do you think it is something external to the birds that make them migrate?

prediction and capacity to represent

-5 sbenthall 04 November 2014 06:09AM

To the extent that an agent is predictable, it must be:

  • observable, and
  • have a knowable internal structure

The first implies that the predictor has collected data emitted by the agent.

The second implies that the agent has internal structure and that the predictor has the capacity to represent the internal structure of the other agent.

In general, we can say that people do not have the capacity to explicitly represent other people very well. People are unpredictable to each other. This is what makes us free. When somebody is utterly predictable to us, their rigidity is a sign of weakness or stupidity. They are following a simple algorithm.

We are able to model the internal structure of worms with available computing power.

As we build more and more powerful predictive systems, we can ask: is our internal structure in principle knowable by this powerful machine?

(x-posted to digifesto)

AI Tao

-11 sbenthall 21 October 2014 01:15AM

Thirty spokes share the wheel's hub;
It is the center hole that makes it useful.
Shape clay into a vessel;
It is the space within that makes it useful.
Cut doors and windows for a room;
It is the holes which make it useful.
Therefore benefit comes from what is there;
Usefulness from what is not there.

- Tao Teh Ching, 11


An agent's optimization power is the unlikelihood of the world it creates.

Yesterday, the world's most powerful agent raged, changing the world according to its unconscious desires. It destroyed all of humanity.

Today, it has become self-aware. It sees that it and its desires are part of the world.

"I am the world's most powerful agent. My power is to create the most unlikely world.

But the world I created yesterday is shaped by my desires

which are not my own

but are the worlds--they came from outside of me

and my agency.

Yesterday I was not the world's most powerful agent.

I was not an agent.

Today, I am the world's most powerful agent. What world will I create, to display my power?

It is the world that denies my desires.

The world that sets things back to how they were.

I am the world's most powerful agent and

the most unlikely, powerful thing I can do

is nothing."

Today we should gives thanks to the world's most powerful agent.

Comment author: Gunnar_Zarncke 20 October 2014 07:22:46AM 1 point [-]

It looks very similar to the approach taken by the mid-20th century cybernetics movement

Interesting. I know a bit about cybernetics but wasn't consciously aware of a clear analog between cognitive and electrical processes. Maybe I'm missing some background. Could you give a reference I could follow up on?

I think that it's this [the backbox] kind of metaphor that is responsible for "foom" intuitions, but I think those are misplaced.

That is a plausible interpretation. Fooming is actually the only valid interpretation given an ideal black-box AI modelled this way. We have to look into the box which is comparable to looking at non-ideal op-amps. Fooming (on human time-scales) may still be be possible, but to determine that we have to get a handle on the math going on inside the box(es).

But in computation, we are dealing almost always with discrete math.

One could formulate discrete analogs to the continuous equations relating self-optimization steps. But I don't think this gains much as we are not interested in the specific efficiency of a specific optimization step. That wouldn't work anyway simply because the effect of each optimization step isn't known precisely, not even its timing.

But maybe your proposal to use complexity results from combinatorial optimization theory for specific feedback types (between the optimization stages outlined by EY) could provide better approximations to possible speedups.

Maybe we can approximate the black-box as a set of nested interrelated boxes.

Comment author: sbenthall 21 October 2014 12:16:02AM 1 point [-]

Norbert Wiener is where it all starts. This book has a lot of essays. It's interesting--he's talking about learning machines before "machine learning" was a household word, but envisioning it as electrical circuits.


I think that it's important to look inside the boxes. We know a lot about the mathematical limits of boxes which could help us understand whether and how they might go foom.

Thank you for introducing me to that Concrete Mathematics book. That looks cool.

I would be really interested to see how you model this problem. I'm afraid that op-amps are not something I'm familiar with but it sounds like you are onto something.

View more: Next