Utility, probability and false beliefs
A putative new idea for AI control; index here.
This is part of the process of rigourising and formalising past ideas.
Paul Christiano recently asked why I used utility changes, rather than probability changes, to have an AI believe (or act as if it believed) false things. While investigating that, I developed several different methods for achieving the belief changes that we desired. This post analyses these methods.
Different models of forced beliefs
Let x and ¬x refer to the future outcome of a binary random variable X (write P(x) as a shorthand for P(X=x), and so on). Assume that we want P(x):P(¬x) to be in the 1:λ ratio for some λ (since the ratio is all that matters, λ=∞ is valid, meaning P(x)=0). Assume that we have an agent, who has utility u, has seen past evidence e, and wishes to assess the expected utility of their action a.
Typically, for expected utility, we sum over the possible worlds. In practice, we almost always sum over sets of possible worlds, the sets determined by some key features of interest. In assessing the quality of health interventions, for instance, we do not carefully and separately treat each possible position of atoms in the sun. Thus let V be the set of variables or values we can about, and v a possible value vector V can take. As usual, we'll be writing P(v) as a shorthand for P(V=v). The utility function u assigns utilities to possible v's.
One of the advantages of this approach is that it can avoid many issues of conditionals like P(A|B) when P(B)=0.
The first obvious idea is to condition on x and ¬x:
- (1) Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a))
The second one is to use intersections rather than conditionals (as in this post):
- (2) Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a))
Finally, imagine that we have a set of variables H, that "screen off" the effects of e and a, up until X. Let h be a set of values H can take. Thus P(x|h,e,a)=P(x|h). One could see H as the full set of possible pre-X histories, but it could be much smaller - maybe just the local environment around X. This gives a third definition:
- (3) Σv Σh u(v)(P(v|h,x,e,a)+λP(v|h,¬x,e,a))P(h|,e,a)
Changing and unchangeable P(x)
An important thing to note is that all three definitions are equivalent for fixed P(x), up to changes of λ. The equivalence of (2) and (1) derives from the fact that Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a)) = Σv u(v)(P(x)P(v|x,e,a)+λP(¬x)P(v|¬x,e,a)) (we write P(x) rather than P(x|e,a) since the probability of x is fixed). Thus a type (2) agent with λ is equivalent with a type (1) agent with λ'=λP(x)/P(¬x).
Similarly, P(v|h,x,e,a)=P(v,h,x|e,a)/(P(x|h,e,a)*P(h|e,a)). Since P(x|h,e,a)=P(x), equation (3) reduces to Σv Σh u(v)(P(x)P(v,h,x|e,a)+λP(¬x)P(v,h,¬x|e,a)). Summing over h, this becomes Σv u(v)(P(x)P(v,x|e,a)+λP(¬x)P(v,¬x|e,a))=Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a)), ie the same as (1).
What about non-constant x? Let c(x) and c(¬x) be two contracts that pay out under x and ¬x, respectively. If the utility u is defined as 1 if a payout is received (and 0 otherwise), it's clear that both agent (1) and agent (3) assess c(x) as having an expected utility of 1 while c(¬x) has an expected utility of λ. This assessment is unchanging, whatever the probability of x. Therefore agents (1) and (3), in effect, see the odds of x as being a constant ratio 1:λ.
Agent (2), in contrast, gets a one-off artificial 1:λ update to the odds of x and then proceeds to update normally. Suppose that X is a coin toss that the agent believes is fair, having extensively observed the coin. Then it will believe that the odds are 1:λ. Suppose instead that it observes the coin has a λ:1 odd ratio; then it will believe the true odds are 1:1. It will be accurate, with a 1:λ ratio added on.
The effects of this percolate backwards in time from X. Suppose that X was to be determined by the toss of one of two unfair coins, one with odds ε:1 and one with odds 1:ε. The agent would assess the odds of the first coin being used rather than the second as around 1:λ. This update would extend to the process of choosing the coins, and anything that that depended on. Agent (1) is similar, though its update rule always assumes the odds of x:¬x being fixed; thus any information about the processes of coin selection is interpreted as a change in the probability of the processes, not a change in the probability of the outcome.
Agent (3), in contrast, is completely different. It assess the probability of H=h objectively, but then assumes that the odds of x and ¬x, given any h, is 1:λ. Thus if given updates about the probability of which coin is used, it will assess those updates objectively, but then assume that both coins are "really" giving 1:1 odds. It cuts off the update process at h, thus ensuring that it is "incorrect" only about x and its consequences, not its pre-h causes.
Utility and probability: assessing goal stability
Agents with unstable goals are likely to evolve towards being (equivalent to) expected utility maximisers. The converse is more complicated, but we'll assume here that an agent's goal is stable if it is an expected utility maximiser for some probability distribution.
Which one? I've tended to shy away from changing the probability, preferring to change the utility instead. If we divide the probability in equation (2) by 1+λ, it becomes a u-maximiser with a biased probability distribution. Alternatively, if we defined u'(v,x)=u(v) and u'(v,¬x)=λu(v), then it is a u'-maximiser with an unmodified probability distribution. Since all agents are equivalent for fixed P(x), we can see that in that case, all agents can be seen as expected utility maximisers with the standard probability distribution.
Paul questioned whether the difference was relevant. I preferred the unmodified probability distribution - maybe the agent uses the distribution for induction, maybe having false probability beliefs will interfere with AI self-improvement, or maybe agents with standard probability distributions are easier to corrige - but for agent (2) the difference seems to be arguably a matter of taste.
Note that though agent (2) is stable, it's definition is not translation invariant in u. If we add c to u, we add c(P(x|e,a)+λP(¬x|e,a)) to u'. Thus, if the agent can affect the value of P(x) through its actions, different constants c likely give different behaviours.
Agent (1) is different. Except for the cases λ=0 and λ=∞, the agent cannot be an expected utility maximiser. To see this, just notice that an update about the process that could change the probability of x, gets reinterpreted as an update on the probability of that process. If we have the ε:1 and 1:ε coins, then any update about their respective probabilities of being used gets essentially ignored (as long as the evidence that the coins are biased is much stronger than the evidence as to which coin is used).
In the cases λ=0 and λ=∞, though, agent (1) is a u-maximiser that uses the probability distribution that assumes x or ¬x is certain, respectively. This is the main point of agent (1) - providing a simple maximiser for those cases.
What about agent (3)? Define u' by: u'(v,h,x)=u(v)/P(x|h), and u'(v,h,¬x)=λu(v)/P(¬x|h). Then consider the u'-maximiser:
- (4) Σv Σh u'(v,h,x)P(v,h,x|e,a)+u'(v,h¬x)P(v,h,¬x|e,a)
Now P(v,h,x|e,a)=P(v|h,x,e,a)P(x|h,e,a)P(h|e,a). Because of the screening off assumptions, the middle term is the constant P(x|h). Multiplying this by u'(v,h,x)=u(v)/P(x|h) gives u(v)P(v|h,x,e,a)P(h|e,a). Similarly, the second term becomes λu(v)P(v|h,¬x,e,a)P(h|e,a). Thus a u'-maximiser, with the standard probability distribution, is the same as agent (3), thus proving the stability of that agent type.
Beyond the future: going crazy or staying sane
What happens after the event X has come to pass? In that case, agent (4), the u'-maximiser will continue as normal. Its behaviour will not be unusual as long as neither λ nor 1/λ is close to 0. The same goes for agent (2).
In contrast, agent (3) will no longer be stable after X, as H no longer screens off evidence after that point. And agent (1) was never stable in the first place, and now it denies all the evidence it sees to determine that impossible events actually happened. But what of those two agents, or the stable ones if λ or 1/λ were close to 0? In particular, what if λ falls below the probability that the agent is deluded in its observation of X?
In those cases, it's easy to argue that the agents would effectively go insane, believing wild and random things to justify their delusions.
But maybe not, in the end. Suppose that you, as a human, believe an untrue fact - maybe that Kennedy was killed on the 23rd of November rather than the 22nd. Maybe you construct elaborate conspiracy theories to account for the discrepancy. Maybe you posit an early mistake by some reporter that was then picked up and repeated. After a while, you discover that all the evidence you can find points to the 22nd. Thus, even though you believe with utter conviction that the assassination was on the 23rd, you learn to expect that the next piece of evidence will point to the 22nd. You look for the date-changing conspiracy, and never discover anything about it; and thus learn to expect they have covered their tracks so well they can't be detected.
In the end, the expectations of this "insane" agent could come to resemble those of normal agents, as long as there's some possibility of a general explanation of all the normal observations (eg a well-hidden conspiracy) given the incorrect assumption.
Of course, the safer option is just to corrige the agent to some sensible goal soon after X.
AI utility-based correlation
A putative new idea for AI control; index here.
This presents one way of implementing the indifference-based correlations of these posts.
Let u be a utility function, a map from worlds to real numbers. An expected utility maximiser considering whether to produce output Y, looks at the expected utility
Σw u(w)P(w|Y).
We now assume that there is another random variable X in the world, and we want the AI to be indifferent to worlds where Y≠X. We also want it to be indifferent to worlds where Z=0. Then it will assess the value of output Y as:
Σw u(w)P(w|X=Y,Z=1,Y).
Now, the idea of the setup was to ensure that Z=1 would erase the output Y so that it was never read. Hence P(w|Z=1,Y) Hence that equation simplifies to:
Σw u(w)P(w|X=Y).
Therefore the AI will chose the Y that maximises the (conditional) expected utility of u if X=Y. To get the full version of the initial post, you need to define some function f of Y and modify this to
Σw u(w)P(w|X=Y) + f(Y).
The Ultimate Testing Grounds
A putative new idea for AI control; index here.
EDIT: the mathematics of this approach can be found here.
In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn't realise the power of the idea.
In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of 'false beliefs', the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.
Many (including Toby Ord) has suggested variants on the 'honey pot' idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we'd expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.
The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.
But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world's nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs - and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.
And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no "almost" in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn't have.
Now, we can't use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.
Ask and ye shall be answered
A putative new idea for AI control; index here.
EDIT: the mathematics of this idea can be found here.
It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.
The question is: what is the impact of X/¬X on the expected utility of a utility function v? Here X/¬X is some well-defined binary outcome. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where
Q(v,X,c) = {E(v|X) - E(v|¬X) > c}.
How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y (or ¬Y), and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, erases the bit from Y before we can see it. In actual fact, P(Z=0)≈1.
The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0 (note: there are subtleties to defining indifference to events that are partially under the AI's control; I will address these in a later post). Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was erased and never read). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.
So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs ¬Y, it expects to get utility E(v|¬X). If it outputs Y, it expects to get utility E(v|X)-c (because of the -c penalty).
Therefore (ignoring ties) M(v#) will output Y iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!
Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".
Utility vs Probability: idea synthesis
A putative new idea for AI control; index here.
This post is a synthesis of some of the ideas from utility indifference and false miracles, in an easier-to-follow format that illustrates better what's going on.
Utility scaling
Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You wish to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.
Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:
u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).
The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:
u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).
If we were to double the odds ratio, the expected utility gain becomes:
u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω, (1)
for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.
We can reproduce exactly the same effect by instead replacing u with u', such that
- u'( |X)=2u( |X)
- u'( |¬X)=u( |¬X)
Then:
u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),
= 2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)). (2)
This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.
Notice that we could also have defined
- u'( |X)=u( |X)
- u'( |¬X)=(1/2)u( |¬X)
This is just the same u', scaled.
The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.
Utility translating
In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.
It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.
But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.
One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:
- Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
- Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.
This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.
Less exploitable value-updating agent
My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.
For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.
One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.
As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.
We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).
The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).
So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).
This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.
Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.
Population ethics and utility indifference
It occurs to me that the various utility indifference approaches might be usable in population ethics.
One challenge for non-total utilitarians is how to deal with new beings. Some theories - average utilitarianism, for instance, or some other systems that use overall population utility - have no problem dealing with this. But many non-total utilitarians would like to see creating new beings as a strictly neutral act.
One way you could do this is by starting with a total utilitarian framework, but subtracting a certain amount of utility every time a new being B is brought into the world. In the spirit of utility indifference, we could subtract exactly the expected utility that we expect B to enjoy during their life.
This means that we should be indifferent as to whether B is brought into the world or not, but, once B is there, we should aim to increase B's utility. There are two problems with this. The first is that, strictly interpreted, we would also be indifferent to creating people with negative utility. This can be addressed by only doing the "utility correction" if B's expected utility is positive, thus preventing us from creating beings only to have them suffer.
The second problem is more serious. What about all the actions that we could do, ahead of time, in order to harm or benefit the new being? For instance, it would seem perverse to argue that buying a rattle for a child after they are born (or conceived) is an act of positive utility, whereas buying it before they were born (or conceived) would be a neutral act, since the increase in expected utility for the child is cancel out by the above process. Not only is it perverse, but it isn't timeless, and isn't stable under self modification.
The germ of an idea
Apologies for posting another unformed idea, but I think it's important to get it out there.
The problem with dangerous AI is that it's intelligent, and thus adapts to our countermeasures. If we did something like plant a tree and order the AI not to eat the apple on it, as a test of its obedience, it would easily figure out what we were doing, and avoid the apple (until it had power over us), even if it were a treacherous apple-devouring AI of DOOM.
When I wrote the AI indifference paper, it seemed that it showed a partial way around this problem: the AI would become indifferent to a particular countermeasure (in that example, explosives), so wouldn't adapt its behaviour around it. It seems that the same idea can make an Oracle not attempt to manipulate us through its answers, by making it indifferent as to whether the message was read.
The ideas I'm vaguely groping towards is whether this is a general phenomena - whether we can use indifference to prevent the AI from adapting to any of our efforts. The second question is whether we can profitably use it on the AI's motivation itself. Something like the reduced impact AI reasoning about what impact it could have on the world. This has a penalty function for excessive impact - but maybe that's gameable, maybe there is a pernicious outcome that doesn't have a high penalty, if the AI aims for it exactly. But suppose the AI could calculate its impact under the assumption that it didn't have a penalty function (utility indifference is often equivalent to having incorrect beliefs, but less fragile than that).
So if it was a dangerous AI, it would calculate its impact as if it didn't have a penalty function (and hence no need to route around it), and thus would calculate a large impact, and get penalised by it.
My next post will be more structured, but I feel there's the germ of a potentially very useful idea there. Comments and suggestions welcome.
Trapping AIs via utility indifference
If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.
That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.
We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:
"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."
Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.
Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.
On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.
And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)