Crossposted at the Intelligent Agents Forum.

EDIT: This method is not intended to solve extortion, just to remove the likelihood of extremely terrible outcomes (and slightly reduce the vulnerability to extortion).

A full solution to the extortion problem is sorely elusive. However, there are crude hacks that we can use to mitigate the downside.

Suppose we figured out that a friendly AI should be maximising an unbounded utility function U. The extortion risk is that another AI could threaten a FAI with unbounded disutility if it didn't go along with its plans. This gives the extorting AI - the EAI - a lot of leverage, and things could end up badly if the EAI ends up acting on its threat.

To combat this, we first have to figure out a level z of utility that is a lower bound on what U could ever reach naturally and realistically.

By "naturally" we mean that U going below z would require not just incompetence or indifference, but some AI actively and deliberately arranging the lowering of U. And "realistically" just means that we're confident that getting U lower than z by chance, or having a U-minimising AI, are exceedingly low.

Then what we can do is to cut off U at the z level, replacing U with U'=max(U,U(z)). See z indicated by the red line on this graph of U' versus U:

What's the consequence of this? First of all, it ensures that no EAI would threaten to reduce U (the utility we really care about) below z, because that is not a threat to the FAI. This reduces the leverage of the EAI, and reduces the impact of it acting on its threat.

Since levels of U below z are exceedingly unlikely to happen by chance, the fact the FAI has the wrong utility below z shouldn't affect it's performance much. And, even in that zone, the AI is still motivated to climb U above z.

But we may still feel unhappy about the flatness of that curve, and want it to still prefer higher U to exceedingly low values. If so, we can replace U with U'' as follows (the blue line is at z-1):

In this case, the EAI will not seek to reduce U below z-1 (in fact, it will specifically target that value), while the FAI has the correct ordering of lower values of U. The utility is weird around z, granted, but this is a place where the FAI would not want to be and would almost certainly not reach by accident.

Though this method does not eliminate the threat of extortion, it does seem to reduce its impact.

New to LessWrong?

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 8:11 PM

That line of thought seems... misguided. For a quick illustration do s/threat/credible threat/g

Effectively you are trying to estimate The Worst That Could Happen and are telling your AI to discount all outcomes below your estimate.

You will need to trust that estimate A LOT.

You will need to trust that estimate A LOT.

Not particularly. You can estimate the likely loss and likely gain from that utility change, as with anything. As long as you're reasonably certain that the bottom parts of the utility function are more likely to be accessed through extortion than through other means, this is a rational thing to do. Absent a proper theory of extortion and attendant decision theory, of course.

As long as you're reasonably certain that the bottom parts of the utility function are more likely to be accessed through extortion than through other means

THIS is the key (along with some explanation of why you think extortion is different than some other interaction with different-valued entities). It's massively counter to my intuitions - I think bottom parts of utility functions are extremely common in natural circumstances without blaming a cause that can be reasoned or traded with.

Think of a total utilitarianism style approach, where you can take any small disutlility and multiply it again and again.

OK. Why would this imply extortion rather than simple poverty?

Because you're the one creating the multiple instances of disutility, using a fraction of the resources of the cosmos.

Maybe more description of the scenario would help. Presumably there's no infinity here - there's a bound to the disutility (for you; presumably it's utility for me) I can get with my fraction of the cosmos. What do you think the proper reaction of an FAI (or a human, for that matter) is, and why is it different for repeated small events than for one large event?

You can estimate the likely loss and likely gain from that utility change, as with anything.

You can try. Your estimate is likely to be very diffuse and uncertain -- the issue is that you are trying to get a handle on the distribution tail and that is quite hard to do (see Taleb's black swans, etc.)

As long as you're reasonably certain that the bottom parts of the utility function are more likely to be accessed through extortion than through other means, this is a rational thing to do

Not at all -- you're forgetting the about the magnitude of consequences.

Let's say you have a blackmailer who wants a pony and she has the capability to meddle with your AI's sensors. Lo and behold, she walks up to the AI and says "I want a pony! Look, there is a large incoming asteroid on a collision course with Earth. Gimme a pony and I'll tell you if it's real".

Ah, says you the designer. I estimate that the blackmailer is bluffing in 99% of the cases. That "bottom part of the utility function" (aka The Sweet Meteor Of Death) is much more likely to be accessed through extortion, a hundred times more likely, in fact.

Therefore I will instruct the AI to disregard any data that tells it there an incoming asteroid on a collision course. And voila -- the blackmailer doesn't get a pony.

What could possibly go wrong?

The sweet meteor of death is well above the z point. Complete human extinction is above the z point.

This hack is not intended to deal with normal extortion, it's intended to cut off really bad outcomes.

it's intended to cut off really bad outcomes

What would these be? Can you give a couple of examples?

Are you basically trying to escape Pascal's Mugging?

Are you basically trying to escape Pascal's Mugging?

The extortion version of that, yes.

What's that? If I don't give into your threat, you'll shoot me in the foot? Well, two can play at that game. If you shoot me in the foot, just watch, I'll shoot my other foot in revenge.

And then I'll bleed on you!

This strategy is dominated by the one where you equip the FAI with the decision theory that made you think this is a good idea, because then it'll use an idea at least as good as this.

And what decision theory is that? We still don't have it, which is why I'm coming up with hacks.

I'm still convinced this is a red herring. Focus on making sure the FAI's utility is aligned with whatever values you have, and it will be no more nor less susceptible to extortion/trade than is rational to maximize those values.

If some EAI credibly threatens to destroy the universe, I WANT my FAI to stop that outcome, which likely means cooperating.

Yeah I think I agree with this. Although I can sort of see the intuition this is coming from ("If you don't do this, I'll torture you" --> just make it impossible to feel pain beyond a certain magnitude), I think this causes some hefty problems in other areas, such as "If you don't do this, I'll kill you." If an agent is trying to maximize the sum of its utility over it's entire lifetime, murdering it would result in the loss of that entire sum of utility, and the agent has every incentive to preserve itself. That might mean giving in to the extortion. Whatever loss was incurred by giving into the extortion might be made up for by not being killed, and preserving it's ability to continue performing actions, if that were the situation.