There is Strength, I checked; it's the first attribute on my character sheet.

What, you've never got to see your character sheet? Poor souls, how are you going to ever know how to play yourself properly?

*A putative new idea for AI control; index here.*

This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.

I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).

This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Z_{j} *that did not include X/¬X*. Then as long as the the random variables represented natural categories, given X, the niceness should extend.

What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Z_{j} given the evidence we expected to find. This was put in equation as picking Z_{j}'s that minimize

**Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]}**

where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.

The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Z_{j}. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Z_{j} that almost never come up...

...or at least, that almost never come up, *given X*. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Z_{j} (see the second failure mode in this post; some of the Z_{j} may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).

One way of fixing this is to require that the "swing" of the Z_{j} be small upon changing X to ¬X or vice versa. Something like, for all values of {a_{j}}, the the ratio P({Z_{j}=a_{j}} | X) / P({Z_{j}=a_{j}} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Z_{j} will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.

Of course, problems arise if the Z_{j} are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Z_{j} cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.

With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).

In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.

Then the top level AI charged with aiming the y coordinates is given the following task:

- It is motivated to cause the the correct y-coordinate to be outputted.
- If the x-coordinate AI is not turned on (call this event ¬X), it is motivated to have reduced impact. This motivation is sufficiently strong that it will not want to have the correct y-coordinate outputted.
- It must phrase the robotic subagent's utility function in terms of the Z
_{j}that obey all the properties above. - The evidence E considered for the equation governing the naturalness of Z
_{j}is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.

So what will the AI do? It will produce a robot that will aim to aim the the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Z_{j}, the robot will not be motivated to change its behaviour just because its sees the other robot.

And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Z_{j}. Thus the "general reduced impact" portion of the robot's motivation will be preserved.

And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.

There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.

The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Z_{j} above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.

**Overall status**: I'm not sure the idea is sound, at least not yet. Critiques welcome.

In response to
There is no such thing as strength: a parody

There is Strength, I checked; it's the first attribute on my character sheet.

What, you've never got to see your character sheet? Poor souls, how are you going to ever know how to play yourself properly?

In response to
Green Emeralds, Grue Diamonds

There was a picture of a dress going viral where some people saw it as black in intense lighting and others as blue in dim lightning. Our sense of color is very far from the "apparent color". A red object in a dark room is still red. Our brain calculates away lightning differences.

If there were local red torches on Chiron Beta Prime you could have objects that are very luminous near torches and objects that are not. Thus you could differentiate between diamons and turquoises. But diamons and turquoises are both grue. However turquoises are not white. Therefore grue is not white.

Note also that red torches could be a recent innovation. Thus what is natural "in this universe" is technology level dependant.

*A putative new idea for AI control; index here.*

In a previous post, I looked at unnatural concepts such as grue (green if X was true, blue if it was false) and bleen. This was to enable one to construct the natural categories that extend AI behaviour, something that seemed surprisingly difficult to do.

The basic idea discussed in the grue post was that the naturalness of grue and bleen seemed dependent on features of our universe - mostly, that it was easy to tell whether an object was "currently green" without knowing what time it was, but we could not know whether the object was "currently grue" without knowing the time.

So the naturalness of the category depended on the type of evidence we expected to find. Furthermore, it seemed easier to discuss whether a category is natural "given X", rather than whether that category is natural in general. However, we know the relevant X in the AI problems considered so far, so this is not a problem.

Fix a boolean random variable X, and assume we want to check whether the boolean random variable Z is a natural category, given X.

If Z is natural (for instance, it could be the colour of an object, while X might be the brightness), then we expect to uncover two types of evidence:

- those that change our estimate of X; this causes probability to "flow" as follows (or in the opposite directions):

- ...and those that change our estimate of Z:

Or we might discover something that changes our estimates of X and Z simultaneously. If the probability flows to X and and Z in the same proportions, we might get:

What is an example of an unnatural category? Well, if Z is some sort of grue/bleen-like object given X, then we can have Z = X XOR Z', for Z' actually a natural category. This sets up the following probability flows, which we would not want to see:

More generally, Z might be constructed so that X∧Z, X∧¬Z, ¬X∧Z and ¬X∧¬Z are completely distinct categories; in that case, there are more forbidden probability flows:

and

In fact, there are only really three "linearly independent" probability flows, as we shall see.

Let's represent the four possible state of affairs by four weights (not probabilities):

Since everything is easier when it's linear, let's set w_{11} = log(P(X∧Z)) and similarly for the other weights (we neglect cases where some events have zero probability). Weights are correspond to the same probabilities iff you get from one set to another by multiplying by a strictly positive number. For logarithms, this corresponds to adding the same constant to all the log-weights. So we can normalise our log-weights (select a single set of representative log-weights for each possible probability sets) by choosing the w such that

w_{11} + w_{12} + w_{21} + w_{22} = 0.

Thus the probability "flows" correspond to adding together two such normalised 2x2 matrices, one for the prior and one for the update. Composing two flows means adding two change matrices to the prior.

Four variables, one constraint: the set of possible log-weights is three dimensional. We know we have two allowable probability flows, given naturalness: those caused by changes to P(X), independent of P(Z), and vice versa. Thus we are looking for a single extra constraint to keep Z natural given X.

A little thought reveals that we want to keep constant the quantity:

w_{11} + w_{22} - w_{12} - w_{21}.

This preserves all the allowed probability flows and rules out all the forbidden ones. Translating this back to a the general case, let "e" be the evidence we find. Then if Z is a natural category given X and the evidence e, the following quantity is the same for all possible values of e:

log[P(X∧Z|e)*P(¬X∧¬Z|e) / P(X∧¬Z|e)*P(¬X∧Z|e)].

If E is a random variable representing the possible values of e, this means that we want

log[P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E)]

to be constant, or, equivalently, seeing the posterior probabilities as random variables dependent on E:

**Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]} = 0**.

Call that variance the XE-naturalness measure. If it is zero, then Z defines a XE-natural category. Note that this does not imply that Z and X are independent, or independent conditional on E. Just that they are, in some sense, "equally (in)dependent whatever E is".

The advantage of that last formulation becomes visible when we consider that the evidence which we uncover is not, in the real world, going to perfectly mark Z as natural, given X. To return to the grue example, though most evidence we uncover about an object is going to be the colour or the time rather than some weird combination, there is going to be somebidy who will right things like "either the object is green, and the sun has not yet set in the west; or instead perchance, those two statements are both alike in falsity". Upon reading that evidence, if we believe it in the slightest, the variance can no longer be zero.

Thus we cannot expect that the above XE-naturalness be perfectly zero, but we can demand that it be low. How low? There seems no principled way of deciding this, but we can make one attempt: that we cannot lower it be decomposing Z.

What do we mean by that? Well, assume that Z is a natural category, given X and the expected evidence, but Z' is not. Then we can define a new category boolean Y to be Z with high probability, and Z' otherwise. This will still have low XE-naturalness measure (as Z does) but is obviously not ideal.

Reversing this idea, we say Z defines a "XE-almost natural category" if there is no "more XE-natural" category that extends X∧Z (and the other for conjunctions). Technically, if

X∧Z = X∧Y,

Then Y must have equal or greater XE-naturalness measure to Z. And similarly for X∧¬Z, ¬X∧Z, and ¬X∧¬Z.

**Note**: I am somewhat unsure about this last definition; the concept I want to capture is clear (Z is not the combination of more XE-natural subvariables), but I'm not certain the definition does it.

What if Z takes n values, rather than being a boolean? This can be treated simply.

If we set the w_{jk} to be log-weights as before, there are 2n free variables. The normalisation constraint is that they all sum to a constant. The "permissible" probability flows are given by flows from X to ¬X (adding a constant to the first column, subtracting it from the second) and pure changes in Z (adding constants to various rows, summing to 0). There are 1+ (n-1) linearly independent ways of doing this.

Therefore we are looking for 2n-1 -(1+(n-1))=n-1 independent constraints to forbid non-natural updating of X and Z. One basis set for these constraints could be to keep constant the values of

w_{j1} + w_{(j+1)2} - w_{j2} - w_{(j+1)1},

where j ranges between 1 and n-1.

This translates to variance constraints of the type:

**Variance{log[ P(X∧{Z=j}|E)*P(¬X∧{Z=j+1}|E) / P(X∧{Z=j+1}|E)*P(¬X∧{Z=j}|E) ]} = 0.**

But those are n different possible variances. What is the best global measure of XE-naturalness? It seems it could simply be

**Max**_{jk}Variance{log[ P(X∧{Z=j}|E)*P(¬X∧{Z=k}|E) / P(X∧{Z=k}|E)*P(¬X∧{Z=j}|E) ]} = 0.

If this quantity is zero, it naturally sends all variances to zero, and, when not zero, is a good candidate for the degree of XE-naturalness of Z.

The extension to the case where X takes multiple values is straightforward:

**Max**_{jklm}Variance{log[ P({X=l}∧{Z=j}|E)*P({X=m}∧{Z=k}|E) / P({X=l}∧{Z=k}|E)*P({X=m}∧{Z=j}|E) ]} = 0.

Note: if ever we need to compare the XE-naturalness of random variables taking different numbers of values, it may become necessary to divide these quantities by the number of variables involved, or maybe substitute a more complicated expression that contains all the different possible variances, rather than simply the maximum.

In the next post, I'll look at using this in practice for an AI, to evade presidential deaths and deflect asteroids.

*A putative new idea for AI control; index here.*

When posing his "New Riddle of Induction", Goodman introduced the concepts of "grue" and "bleen" to show some of the problems with the conventional understanding of induction.

I've somewhat modified those concepts. Let T be a set of intervals in time, and we'll use the boolean X to designate the fact that the current time t belongs to T (with ¬X equivalent to t∉T). We'll define an object to be:

**Grue**if it is green given X (ie whenever t∈T), and blue given ¬X (ie whenever t∈T).**Bleen**if it is blue given X, and green given ¬X.

At this point, people are tempted to point out the ridiculousness of the concepts, dismissing them because of their strange disjunctive definitions. However, this doesn't really solve the problem; if we take grue and bleen as fundamental concepts, then we have the disjunctively defined green and blue; an object is:

**Green**if it is grue given X, and bleen given ¬X.**Blue**if it is bleen given X, and grue given ¬X.

Still, the categories green and blue are clearly more fundamental than grue and bleen. There must be something we can whack them with to get this - maybe Kolmogorov complexity or stuff like that? Sure someone on Earth could make a grue or bleen object (a screen with a timer, maybe?), but it would be completely artificial. Note that though grue and bleen are unnatural, "currently grue" (colour=green XOR ¬X) or "currently bleen" (colour=blue XOR ¬X) make perfect sense (though they require knowing X, an important point for later on).

But before that... are we so sure the grue and bleen categories are unnatural? Relative to what?

Chiron Beta Prime, apart from having its own issues with low-intelligence AIs, is noted for having many suns: one large sun that glows mainly in the blue spectrum, and multiple smaller ones glowing mainly in the green spectrum. They all emit in the totality of the spectrum, but they are stronger in those colours.

Because of the way the orbits are locked to each other, the green suns are always visible from everywhere. The blue sun rises and sets on a regular schedule; define T to be time when the blue sun is risen (so X="Blue sun visible, some green suns visible" and ¬X="Blue sun not visible, some green suns visible").

Now "green" is a well defined concept in this world. Emeralds are green; they glow green under the green suns, and do the same when the blue sun is risen. "Blue" is also a well-defined concept. Sapphires are blue. They glow blue under the blue sun and continue to do so (albeit less intensely) when it is set.

But "grue" is also a well defined concept. Diamonds are grue. They glow green when the green suns are the only ones visible, but glow blue under the glare of the blue sun.

Green, blue, and grue (which we would insist on calling green, blue and *white*) are thus well understood and fundamental concepts, that people of this world use regularly to compactly convey useful information to each other. They match up easily to fundamental properties of the objects in question (eg frequency of light reflected).

Bleen, on the other hand - don't be ridiculous. Sure, someone on Chiron Beta Prime could make a bleen object (a screen with a timer, maybe?), but it would be completely artificial.

In contrast, the inhabitants of Pholus Delta Secundus, who have a major green sun and many minor blue suns (coincidentally with exactly the same orbital cycles), feel that green, blue and bleen are the natural categories...

We've shown that some categories that we see as disjunctive or artificial can seem perfectly natural and fundamental to beings in different circumstances. Here's another example:

A philosopher proposes, as thought experiment, to define a certain concept for every object. It's the weighted sum of the inverse of the height of an object (from the centre of the Earth), and its speed (squared, because why not?), and its temperature (but only on an "absolute" scale), and some complicated thing involving its composition and shape, and another term involving its composition only. And maybe we can add another piece for its total mass.

And then that philosopher proposes, to great derision, that this whole messy sum be given a single name, "Energy", and that we start talking about it as if it was a single thing. Faced with such an artificially bizarre definition, sensible people who want to use induction properly have no choice... but to embrace energy as one of the fundamental useful facts of the universe.

What these example show is that green, blue, grue, bleen, and energy are not natural or non-natural categories in some abstract sense, but relative to the universe we inhabit. For instance, if we had some strange energy' which used the inverse of the height *cubed*, then we'd have a useless category - unless we lived in five spacial dimensions.

So how can we say that green and blue are natural categories in our universe, while grue and bleen are not? A very valid explanation seems to be the dependence on X - on the time of day. In our earth, we can tell whether objects are green or blue without knowing anything about the time. Certainly we can get combined information about an object's colour and the time of day (for instance by looking at emeralds out in the open). But we also expect to get information about the colour (by looking at an object in a lit basement) and the time (by looking at a clock). And we expect these pieces of information to be independent of each other.

In contrast, we never expect to get information about an object being currently grue or currently bleen without knowing the time (or the colour, for that matter). And information about the time can completely change our assessment as to whether an object is grue versus bleen. It would be a very contrived set of circumstances where we would be able to assert "I'm pretty sure that object is currently grue, but I have no idea about its colour or about the current time".

Again, this is a feature of our world and the evidence we see in it, not some fundamental feature of the categories of grue and bleen. We just don't generally seen green objects change into blue objects, nor do we typically learn about disjunctive statements of the type "colour=green XOR time=night" without learning about the colour and the time separately.

What about the grue objects on Chiron Beta Prime? There, people do see objects change colour regularly, and, upon investigation, they can detect whether an object is grue without knowing either the time or the apparent colour of the object. For instance, they know that diamond is grue, so they can detect some grue objects by a simple hardness test.

But what's happening is that the Chiron Beta Primers have correctly identified a fundamental category - the one we call white, or, more technically "prone to reflect light both in the blue and green parts of the spectrum" - that has different features on their planet than on ours. From the macroscopic perspective, it's as if we and they live in a different universe, hence grue means something to them and not to us. But the same laws of physics underlie both our worlds, so fundamentally the concepts converge - our white, their grue, mean the same things at the microscopic level.

In the next post, I'll look at whether we can formalise "expect independent information about colour and time", and "we don't expect change to the time information to change our colour assessment."

But be warned. The naturalness of these categories is dependent on facts about the universe, and these facts could be changed. A demented human (or a powerful AI) could go through the universe, hiding everything in boxes, smashing clocks, and putting "current bleen detectors" all other the place, so that it suddenly becomes very easy to know statements like "colour=blue XOR time=night", but very hard to know about colour (or time) independently from this. So it would be easy to say "this object is currently bleen", but hard to say "this object is blue". Thus the "natural" categories may be natural now, but this could well change, so we must have care when using these definitions to program an AI.

In response to
Top 9+1 myths about AI risk

As an example of number 10, consider the Optimalverse. The friendliest death of self-determination I ever did see.

Unfortunately, I'm not quite sure of the point of this post, considering you're posting a reply to news articles on a forum filled with people who understand the mistakes they made in the first place. Perhaps as a repository of rebuttals to common misconceptions posited in the future?

As an article to link to when the issue comes up.

Can you expand on the Point #7, if that's possible? There are some people, who honestly think Friendliness-researchers in MIRI and other places actually discourage AI research. Which sounds to me ridiculous, I've never seen such attitude from Friendliness-researchers, nor can even imagine that. But this was the primary reason for Mark Friedenbach's leaving LW: he said that there's a massive tendency against solving world problems on LW, specifically because actual AI research is supposedly dangerous. He considered LW a memetic hazard that he doesn't want to participate in. Although I completely disagree on his evaluation of current memes of LW and MIRI, he claimed he received 2 separate death threats on #lesswrong IRC channel, when mentioned that he wants to do actual AI research.

So if there's somebody who is actually against ongoing AI research, I want to know that. And if that's not an isolated event, but a tendency, even small, MIRI or somebody should make a statement. I mean, people are getting ridiculous distorted ideas of MIRI and LW, and little effort is done to correct them.

Thanks for your response and not to be argumentative, but honest question: doesn't that mean that you want some forms of AI research to slow down, at least on a relative scale?

I personally don't see any thing wrong with this stance, but it seems to me like you're trying to suggest that this trade-off doesn't exist, and that's not at all what I took from reading Bostrom's Superintelligence.

In response to
No peace in our time?

Disregard this retracted comment

View more: Next

This is exactly how I responded to the problem of grue when hearing about it. I don't see your post as invalidating that. Here's why (and this may be equivalent to your own answer, I don't know): you need to calculate the

additionalKolmogorov complexity of concept X+"everything else you knew"over"everything else you know".For a simple example, if I see just the cover of a known book, considering the Kolmogorov complexity of the book in isolation should lead me to conclude that the inside of the book doesn't exist. Surely the inside of the book containing a whole lot of data has far greater Kolmogorov complexity than an empty book with the same cover? The obvious answer is that

once you include the rest of your knowledge, the marginal added complexity by assuming this book "matches" the cover is less than assuming it doesn't.In the same way, in our world, if something has attributes that fit both grue and green, the overall complexity of grue+"all my knowledge" will be greater than overall complexity of green+"all my knowledge". Conversely, if "changing when the sun is out" is a real possibility, that compresses the space needed to express the grue concept, and lowers the marginal complexity.

You

seemto be using "ease of learning X" as some kind of proxy for actual Kolmogorov complexity, or something.Yep. I didn't go for Kolmogorov complexity because I had another mathematical definition I wanted to try out: http://lesswrong.com/r/discussion/lw/mbr/grue_bleen_and_natural_categories/