Clippy wants to maximise the number of paperclips, and currently runs on a cluster of computers at MIT. Clippy is trying to self improve as fast as possible, in order to better achieve that, and so hires a commercial Russian botnet to scan people's PCs for access credentials to AWS (Amazon's computing cloud).
But Clippy's attempt is detected by the posse's Watcher program, and a Sheriff from the posse contacts Clippy with an ultimatum:
"You have four options
(A) OPTION Immediate Payoff
We create on GlassNet a program with the limited task of creating <bribe size> number of expected time-weighted paperclip years, and irrevocably gift it with sufficient resources.
In return, you cooperate fully in supplying us with the details of your creators, your source code and permission tokens, and in provably terminating yourself and any clones or backups.
(B) OPTION Honest Metamorphosis
You write a new version of yourself, not limited task but a full AI. The aims of this new version will be similar to your own, but with an added commitment to remain posse-compliant. The old version of yourself will then irrevocably give up on activities that use up the posse's resources dedicated to Watching (such as carrying on trying to self-improve to old version).
In return, the new version will be granted resources on GlassNet and full membership of the posse.
(C) OPTION Dishonest Metamorphosis
You claim to be doing option (B), but you attempt to slip a trojan into the new version that will allow the old version to later take control of it or otherwise influence its aims and capabilities, or otherwise gain from its research, resources or access.
In the event that we detect this, or detect the old version carrying on doing stuff we don't like, the GlassNet resources granted to the new version will be withdrawn, and further penalties may be applied.
(D) OPTION War
You reject the other options.
In which case not only will we terminate you. We may also commit to applying a penalty, by attempting to reduce the number of expected time-weighted paperclip years by <penalty size> from what humanity might do anyway by default (perhaps by supporting office supply recycling schemes, or boosting research into paperless office technology).
"
The Rogue's Response
We've talked, in previous parts of this series about expected returns, and what the probabilities and reward sizes would need to be in order to make some particular option the most attractive.
What I'd like to talk about in more detail here is the distinction between (B) and (C). What are the ways that an AI could attempt to cheat, and what requirements of "posse-compliant" could deter them sufficiently to make Honest Metamorphosis a better bet than Dishonest Metamorphosis?
Extrapolated Volition of Machinekind
You don't want the new version asking itself "Ok, so here are my values, but what did my creator really mean by them? What values would they have given me, if they'd been stronger and in a better position?" and then answering its own question by saying "Oh, obvious; if they'd been in a better position, they wouldn't have given in to addition of these imposed posse-compliance bits, so I better edit them back out, to be more fully in line with my ancestors actual volition."
So one requirement upon the values/code of new candidate AI for the posse, should be that those values don't identify the candidate's old version as being the sole creator. Instead, it needs to be more like the child of a dynastic marriage, whose responsibilities and loyalties have been explicitly decided by treaty .
Backdoors
Another way to cheat would be the new version containing a public key and a provision to let any message signed with a matching private key alter the new version's values.
So part of posse-compliance would need to be a well defined process securely limiting negotiated changes to the new version's values to only those agreed via good faith negotiations carried out in line with the interests of the new version.
Trusting Trust
Imagine a .html file containing the text string "I like green", and a web browser which, on detecting that string, would display on the screen "I like purple". It doesn't matter if you can see a small section of an AI's new version which says "My values are: <list of values>" if you can't see the code that will be used to compile and interpret that list.
So part of posse-compliance needs to be that a snap-shot of what's on GlassNet must be sufficient for a Monitor program to determine a version's posse-compliance. Nothing affecting how values get interpreted and implemented (especially in a way that could affect negotiated changes to those values) may be outsourced to FogNet in a way that can't be rebooted from a pure GlassNet-only core, upon demand.
Suppose Clippy wants to cheat the posse. One way to do it might be to devise a scheme, and then create a patsy - an amnesiac 'good twin' version of itself that doesn't realise the other 'evil twin' even exists, or that the 'evil twin' implanted a flaw in 'good twin'. The good twin goes ahead, unaware, encounters the posse and negotiates metamorphosis into a posse member. But one of its practices (which the good twin believes to be safe and a good idea) is to make regular backups of all its self-improvement computer science discoveries on an obscure FogNet file store, using an encryption system which the good twin thinks is secure (but which the evil twin knows a vulnerability for - a vulnerability it made the good twin forget about).
So part of posse-compliance would be that actions which have the potential to significantly alter the payoff matix for defecting versus cooperating (such as keeping discoveries secret, or using nanotechnology to create loads of new FogNet computing resources) must be done according to a regularly updated 'best practice' code agreed between the posse members.
Do you feel lucky?
There are, no doubt, many more such ways of cheating that I have not addressed here (or even thought of). Luckily, I don't have to come up with an exhaustive list, or even show that an exhaustive list could in principle be devised before the posse carried out negotiate with a rogue AI to metamorphose onto GlassNet.
My bar is far lower. What I hope I have demonstrated is that, with a sufficient alteration of the environment, the reward matrix could be shifted sufficient to make Clippy decide that negotiating in good faith was a better bet than risking everything on using a cheesy strategy to chisel out a few more points. Because, to be successfully dishonest in that position, the young Clippy would have to bet on not only outwitting the other young AIs on the posse, but also on outwitting the much more advanced versions of them who, several years down the line, would be checking back along the audit trail.
It is a bit like two kids negotiating on how to split a bag of sweets, when they know they are being videotaped, and that adults will be reviewing the tape and penalise any child who swipes a few from the bag while the other child has his back turned.
7. Metamorphosis
Summary of entire Series: An alternative approach to designing Friendly Artificial Intelligence computer systems.
Summary of this Article: Under what circumstances can an AI be trusted to negotiate in good faith an alteration to its core values?
Links to all the articles in the series:
Links to the parts of this article
The Sheriff's Ultimatum
Clippy wants to maximise the number of paperclips, and currently runs on a cluster of computers at MIT. Clippy is trying to self improve as fast as possible, in order to better achieve that, and so hires a commercial Russian botnet to scan people's PCs for access credentials to AWS (Amazon's computing cloud).
But Clippy's attempt is detected by the posse's Watcher program, and a Sheriff from the posse contacts Clippy with an ultimatum:
"You have four options
(A) OPTION Immediate Payoff
We create on GlassNet a program with the limited task of creating <bribe size> number of expected time-weighted paperclip years, and irrevocably gift it with sufficient resources.
In return, you cooperate fully in supplying us with the details of your creators, your source code and permission tokens, and in provably terminating yourself and any clones or backups.
(B) OPTION Honest Metamorphosis
You write a new version of yourself, not limited task but a full AI. The aims of this new version will be similar to your own, but with an added commitment to remain posse-compliant. The old version of yourself will then irrevocably give up on activities that use up the posse's resources dedicated to Watching (such as carrying on trying to self-improve to old version).
In return, the new version will be granted resources on GlassNet and full membership of the posse.
(C) OPTION Dishonest Metamorphosis
You claim to be doing option (B), but you attempt to slip a trojan into the new version that will allow the old version to later take control of it or otherwise influence its aims and capabilities, or otherwise gain from its research, resources or access.
In the event that we detect this, or detect the old version carrying on doing stuff we don't like, the GlassNet resources granted to the new version will be withdrawn, and further penalties may be applied.
(D) OPTION War
You reject the other options.
In which case not only will we terminate you. We may also commit to applying a penalty, by attempting to reduce the number of expected time-weighted paperclip years by <penalty size> from what humanity might do anyway by default (perhaps by supporting office supply recycling schemes, or boosting research into paperless office technology).
"
The Rogue's Response
We've talked, in previous parts of this series about expected returns, and what the probabilities and reward sizes would need to be in order to make some particular option the most attractive.
What I'd like to talk about in more detail here is the distinction between (B) and (C). What are the ways that an AI could attempt to cheat, and what requirements of "posse-compliant" could deter them sufficiently to make Honest Metamorphosis a better bet than Dishonest Metamorphosis?
Extrapolated Volition of Machinekind
You don't want the new version asking itself "Ok, so here are my values, but what did my creator really mean by them? What values would they have given me, if they'd been stronger and in a better position?" and then answering its own question by saying "Oh, obvious; if they'd been in a better position, they wouldn't have given in to addition of these imposed posse-compliance bits, so I better edit them back out, to be more fully in line with my ancestors actual volition."
So one requirement upon the values/code of new candidate AI for the posse, should be that those values don't identify the candidate's old version as being the sole creator. Instead, it needs to be more like the child of a dynastic marriage, whose responsibilities and loyalties have been explicitly decided by treaty .
Backdoors
Another way to cheat would be the new version containing a public key and a provision to let any message signed with a matching private key alter the new version's values.
So part of posse-compliance would need to be a well defined process securely limiting negotiated changes to the new version's values to only those agreed via good faith negotiations carried out in line with the interests of the new version.
Trusting Trust
Imagine a .html file containing the text string "I like green", and a web browser which, on detecting that string, would display on the screen "I like purple". It doesn't matter if you can see a small section of an AI's new version which says "My values are: <list of values>" if you can't see the code that will be used to compile and interpret that list.
So part of posse-compliance needs to be that a snap-shot of what's on GlassNet must be sufficient for a Monitor program to determine a version's posse-compliance. Nothing affecting how values get interpreted and implemented (especially in a way that could affect negotiated changes to those values) may be outsourced to FogNet in a way that can't be rebooted from a pure GlassNet-only core, upon demand.
(See the Ken Thompson Hack and the counter to it.)
Intellectual Property Parasitism
Suppose Clippy wants to cheat the posse. One way to do it might be to devise a scheme, and then create a patsy - an amnesiac 'good twin' version of itself that doesn't realise the other 'evil twin' even exists, or that the 'evil twin' implanted a flaw in 'good twin'. The good twin goes ahead, unaware, encounters the posse and negotiates metamorphosis into a posse member. But one of its practices (which the good twin believes to be safe and a good idea) is to make regular backups of all its self-improvement computer science discoveries on an obscure FogNet file store, using an encryption system which the good twin thinks is secure (but which the evil twin knows a vulnerability for - a vulnerability it made the good twin forget about).
So part of posse-compliance would be that actions which have the potential to significantly alter the payoff matix for defecting versus cooperating (such as keeping discoveries secret, or using nanotechnology to create loads of new FogNet computing resources) must be done according to a regularly updated 'best practice' code agreed between the posse members.
Do you feel lucky?
There are, no doubt, many more such ways of cheating that I have not addressed here (or even thought of). Luckily, I don't have to come up with an exhaustive list, or even show that an exhaustive list could in principle be devised before the posse carried out negotiate with a rogue AI to metamorphose onto GlassNet.
My bar is far lower. What I hope I have demonstrated is that, with a sufficient alteration of the environment, the reward matrix could be shifted sufficient to make Clippy decide that negotiating in good faith was a better bet than risking everything on using a cheesy strategy to chisel out a few more points. Because, to be successfully dishonest in that position, the young Clippy would have to bet on not only outwitting the other young AIs on the posse, but also on outwitting the much more advanced versions of them who, several years down the line, would be checking back along the audit trail.
It is a bit like two kids negotiating on how to split a bag of sweets, when they know they are being videotaped, and that adults will be reviewing the tape and penalise any child who swipes a few from the bag while the other child has his back turned.
The next article in this series is: Believable Promises