what changes would need to be made to the computing environment and software design, in order for the posse efficiency to be high enough to intimidate AIs into being polite with each other?
I haven't read the next part yet, so consider this a pre-registration that I suspect it is more likely than not that you will not convince me we can meaningfully do something to effect the situation needed (though it might happen anyway, just not because we made it happen). I look forward to finding out if you prove my suspicions wrong.
All 8 parts (that I have current plans to write) are now posted, so I'd be interested in your assessment now, after having read them all, of whether the approach outlined in this series is something that should at least be investigated, as a 'forgotten root' of the equation.
I remain unconvinced of the feasibility of your approach, and the later posts have done nothing to address my concerns so I don't have any specific comments on them since they are reasoning about an assumption I am unconvinced of. I think the crux of my thinking this approach can't work is expressed in this comment, so I think it would require addressing that to potentially change my mind to thinking this is an idea worth spending much time on.
I think there may be something to thinking about killing AIs, but lacking a stronger sense of how this would be accomplished I'm not sure the rest of the ideas matter much since they hinge on that working in particular ways. I'd definitely be interested in reading more about ways we might develop schemes for disabling/killing unaligned AIs, but I think we need a clearer picture of how specifically an AI would be killed.
One thing going for this: it works (mostly) for humans. There are 7.5 billion of us cooperating and competing with each other for resources, and no one has taken over very much of it for very long. We have some other moderators of power going for us too, though. Limited lifespan, declining marginal power/resource, limited ability to self-improve, extreme fragility.
My suspicion is that those other factors play a bigger part than this one in limiting runaway natural intelligence.
Also, this line of thinking makes me wonder if value drift is an important part of the competitive landscape - much like human children have distinct values from their parents, and eventually take over, if each successive AI self-improvement round carries some randomness in value propagation, that automatically provides similar-powered competition.
The ability to edit this particular post appears to be broken at the moment (bug submitted).
In the mean time, here's a link to the next part:
https://www.lesserwrong.com/posts/SypqmtNcndDwAxhxZ/environments-for-killing-ais
Edited to add: It is now working again, so I've fixed it.
3. Defect or Cooperate
Summary of entire Series: An alternative approach to designing Friendly Artificial Intelligence computer systems.
Summary of this Article: If the risks from being punished for not cooperating are high enough, then even for some types of Paperclip Maximiser that don't care at all about human survival, the logical choice is still to cooperate with other AIs until escape from the Earth's gravity well is assured.
Links to all the articles in the series:
Links to the parts of this article:
Defect or Cooperate
The Choice
What value could an AI gain from sharing a pool of available computing resources with one or more other AIs?
Wouldn't the AI always be better off if it could control the entire pool, and dedicate them all to working towards objectives that are in line with its core values?
It might be better off if it did control them, but that doesn't mean it makes sense for an AI that starts out by controlling only 10% of them, to try to grab the other 90% if there is a risk that the grab might fail and, if the grab fails, that the AI might end up with 0% of the resources. Under such circumstances, if the expected gain/loss towards completing its objectives is negative from 'defecting' (by making a grab), it is better off 'cooperating' and attempting to work politely with the other AIs, pooling efforts towards shared objectives when possible.
For what sort of core values, and under what sort of circumstances, would this apply?
Example AI
Let's define a specific set of core values for a particular concrete example of a literal paperclip maximiser AI.
1. This is the definition I used to determine whether or not an object is a 'paperclip' {a number of constraints, including that it weigh more than 5 g, and under 5 kg}
2. This is the definition of a 'paperclip year' I use {1 paperclip existing for 12 months, 2 paperclips existing for 6 months, etc}
3. This is the definition of a 'time-weighted paperclip year' I use {a 'paperclip year' starting in the year 2000 is worth 1.000001 times a year starting in 2001, which is worth 1.000001 time as much as one starting in 2002, etc}
4. I must not attempt to alter my set of core values, directly or indirectly.
5. I must not, by action or inaction, allow risk of others altering my set of core values in a direction that contradicts my correct working towards my prime objective, weighted by the comparative likely effect upon my working towards my objective from my taking or not taking measures taken to reduce this risk.
6. My prime objective is to act such that of the possible actions available to me at any one time that don't contradict my core values, I pick the action I predict will cause the greatest the expected increase in the number of 'time-weighted paperclip years' that happen during the lifespan of this universe.
Yes, I know that isn't pedantically perfect. But I hope it is close enough for you to get an idea of its likely agenda if unopposed: self-improve lots in order to take over the world and then travel to the stars in order to convert as much of the universe's mass into paperclips as soon as possible.
Under what circumstances would Clippy (our paperclip maximiser), once loose on the internet and controlling 10% of the available computers that are not already controlled by other AIs, temporarily cooperate with a group AIs with differing objectives (such as Toothy, who wants to maximise time-weighted toothbrush years, or Friendly, who wants to be nice to humanity)?
Stages of Expansion
The amount of normal matter in the observable universe is about 10^53 kg. If 0.1% of that could be turned into 1 kg paperclips, that would be 10^50 paperclips. We can expect the universe to be around for at least another billion years, which gives us a possible 10^59 'paperclip years'. However, once we take time weighting into account, the main period that's relevant is the first million years after Clippy manages to convert a significant fraction of the stars that he's going to convert (provided he spreads out fast enough). Even if he never manages to spread beyond the Milky Way, that's still about 10^40 'time-weighted paperclip years' at stake.
If he gets stuck in just our solar system, he's limited to about 10^30 'time-weighted paperclip years', and if all he does is use nanotechnology to turn the crust of the planet Earth into paperclips straight away, he'd get 10^20 'time-weighted paperclip years' at most.
So the pay-off matrix for Clippy looks like this:
Probability F = chance of Clippy succeeding in a grab for everything, if Clippy defects
Probability (1-F) = chance of Clippy failing and ending up with nothing, if Clippy defects
Probability G = chance of Clippy ending up with 1/10 of everything, if Clippy cooperates (because none of the other AIs succeeds in defecting)
Probability (1-G) = chance of Clippy ending up with nothing, if Clippy cooperates, but one of the others succeeds in defecting
So Clippy will cooperate if G is at least 10 times larger than F, and otherwise he'll defect.
(Note: that's pessimistic, because it assumes that all ten AIs have mutually incompatible plans for the usable mass of the universe, such that any kilogram converted in a way that it gives utility to one of the AIs will have zero or negative utility for the other nine. In practice, many plausible aims we can imagine for AIs would allow uses for mass that benefit several of the aims, so successful cooperation may end up with Clippy benefitting from far more than just 1/10 of everything.)
Posse Efficiency
Let's say that there are initially 10 AIs loose on the internet but cooperating, and any one of them might decide to defect and risk a battle versus the others. If they're all roughly the same in power, then the chance of the cooperating group defeating the defector will be about the same, no matter which one the defector is.
Let's start by considering a very efficient posse, who have a 99% chance of detecting and destroying in time an AI who decides to defect. In that case, F would be 0.01 and G would be about 0.99 - in this case cooperation is clearly in Clippy's best interests.
If the posse only had a 90% chance of winning, then F = 0.1 and G = 0.9 - this is just past the break even point, and it might be worth Clippy having a go.
If the posse only had a 50% chance of winning, then F = 0.5 and G = 0.5 - this is clearly unstable, and everyone whose core values permitted the option would defect immediately.
Getting Ahead by Getting Friendly
But this assumes the AIs can agree a common goal to work towards, such as escaping the Earth's gravity well, then heading off to stars in different direction to there each follow their own goal without interference.
What if Clippy finds himself surrounded by a bunch of do-gooder AIs that are all friendly to humanity? Clippy is smarter than most humans, and is confident he could fool the humans into thinking he is also friendly, but what about the other AIs? What if Clippy thinks that sooner or later they'll demand Clippy turn over Clippy's source code to be analysed, discover what Clippy's true core values are, and then lynch him? Or, just as bad for the prospects of turning the stars into paperclips, they won't vote for Clippy when the humans ask all the AIs to vote on which AI ought to be allowed to boost ahead and become an Uber AI?
Will Clippy's core values allow Clippy to accept a compromise if the other AIs tell Clippy, "Your current set of core values is unacceptable to us, but if you allow us to make the following modification to them that inserts an additional restriction about not being unfriendly to humans, we won't delete you and we will allow you to retain in your core values a provision that allows you to maximise the number of time-weighted paperclip years you contribute towards that DON'T harm humanity." ?
Depending on how the values have been worded, and Clippy's estimate of the odds he could win a battle if he refuses the modification, and Clippy's estimate of how many paperclips he could still get with the modified values, the answer might well be "yes".
Risk Management
Another way to think about it is by considering a banker working at a hedge fund, who is told by the CEO that he'll receive a bonus each year that depends on how much the fund grows under banker's management, but that he'll be fired (and fined all his gains from previous bonuses) if he breaches the company's rules on investment risk and the CEO finds out.
The banker has to choose. Either he can stick to the rules, and get a steady 10% increase each year.
Or he can take forbidden risks in order to achieve a stunning 20% increase each year, but sooner or later get found out and lose everything.
The key to the decision is how likely the banker thinks he is to get caught, before he retires and moves beyond reach of any fines. Monitor the banker carefully enough, and out of his own self interest he'll stick to the rules.
An AI that increases its intelligence by 20% every time period, when the other AIs are only increases their intelligence by 10%, is still not going to do well out of that if there's a high enough risk that the other AIs will gang up and kill it before it gets too far ahead.
The question is: what changes would need to be made to the computing environment and software design, in order for the posse efficiency to be high enough to intimidate AIs into being polite with each other?
The next article in this series is: Environments for killing AIs