Presidents, asteroids, natural categories, and reduced impact
A putative new idea for AI control; index here.
EDIT: I feel this post is unclear, and will need to be redone again soon.
This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.
Extending niceness/reduced impact
I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).
This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.
What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize
- Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]}
where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.
Perfect and imperfect extensions
The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...
...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).
One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.
Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.
High impact from reduced impact
With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).
In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.
Then the top level AI charged with aiming the y coordinates is given the following task:
- It is motivated to cause the correct y-coordinate to be outputted.
- If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
- It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
- The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.
So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.
And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.
And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.
Failure modes
There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.
The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.
Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.
The president didn't die: failures at extending AI behaviour
A putative new idea for AI control; index here.
In a previous post, I considered the issue of an AI that behaved "nicely" given some set of circumstances, and whether we could extend that behaviour to the general situation, without knowing what "nice" really meant.
The original inspiration for this idea came from the idea of extending the nice behaviour of "reduced impact AI" to situations where they didn't necessarily have a reduced impact. But it turned out to be connected with "spirit of the law" ideas, and to be of potentially general interest.
Essentially, the problem is this: if we have an AI that will behave "nicely" (since this could be a reduced impact AI, I don't use the term "friendly", which denotes a more proactive agent) given X, how can we extend its "niceness" to ¬X? Obviously if we can specify what "niceness" is, we could just require the AI to do so given ¬X. Therefore let us assume that we don't have a good definition of "niceness", we just know that the AI has that given X.
To make the problem clearer, I chose an X that would be undeniably public and have a large (but not overwhelming) impact: the death of the US president on a 1st of April. The public nature of this event prevents using approaches like thermodynamic miracles to define counterfactuals.
I'll be presenting a solution in a subsequent post. In the meantime, to help better understand the issue, here's a list of failed solutions:
First Failure: maybe there's no problem
Initially, it wasn't clear there was a problem. Could we just expect niceness to extend naturally? But consider the following situation: assume the vice president is a warmonger, who will start a nuclear war if ever they get into power (but is otherwise harmless).
Now assume the nice AI has the conditional action criteria: "if the vice president ever becomes president, launch a coup". This is safe, it can be extended to the ¬X situation in the way we want.
However, conditioning on X, that criteria is equivalent with "launch a coup on the 2nd of April". And if the AI has that criteria, then extending it to ¬X is highly non-safe. This illustrates that there is a real problem here - the coup example is just one of the myriad of potential issues that could arise, and we can't predict them all.
Second failure: don't condition on X
Maybe the trick could be preventing the AI from conditioning on X (for anything)? If the AI itself can't tell the difference between X and ¬X, wouldn't its nice behaviour extend?
But this doesn't help. The AI could merely condition on things tightly coupled with X. Like news reports of the president's demise, or a conjunction of other events almost impossible under ¬X.
Third failure: disbelieve the truth
In this design, the AI simply assumes that X happens, whatever the evidence. Therefore it would stay nice, whatever happened.
But this results in an AI who's beliefs are strongly decoupled with reality. The AI must believe that there is a vast conspiracy concealing the president's death; the more it investigates this conspiracy, and fails to find it, the more vast and powerful it becomes. What if you show it undeniable, impossible-to-fake-by-the-laws-of-physics evidence of the president's demise? Then it starts to believe that the laws of physics are all wrong.
Though the AI is "nice", it's niceness is appropriate for a world of invisible conspiracies and alternate physics - not for the real world. It's clear how such an AI could start doing crazy and dangerous things as its view of reality starts to diverge.
Fourth failure: hack the models
This example is due to Daniel Dewey. He imagined that the AI would maintain a full collection of models and update them normally. But before making a prediction, each model would get hacked to artificially set ¬X to X.
To see how this could fail, imagine all evidence about the world is stored in a giant vector of data. These vectors are calibrated correctly, then ¬X is flipped to X. The problem is that, while that is small change to the vector, it could be a huge change to the world that it encodes.
For instance, let's go back to the warmongering vice president, and assume that nuclear war is inevitable, given X. Then the AI will walk around in a peaceful world, update its models - and act as if it was in a nuclear wasteland, because those are the only possibilities, given X. Essentially, the AI will move through our universe, harvesting information that would inform its actions in a parallel universe - and acting as if it existed there instead of here.
For instance, it could wander into a flower show where someone is talking about difficulties growing roses in southern Quebec. It adds this data to its vector, noting that the soil there must be a bit unsuitable to plant growth. It therefore concludes that it must write to the (non-existent) Third God-Emperor of America and advise it to give up on the Quebec Anglican Protectorate, which must be misreporting their agriculture output, given this data.
It's interesting to contrast this AI with the previous one. Suppose that the nuclear war further implies that Paris must be a smoking crater. And now both AIs must walk around a clearly bustling and intact Paris. The disbelieving AI must conclude that this is an elaborate ruse - someone has hidden the crater from its senses, put up some fake building, etc... The model-hacking AI, meanwhile, acts as if it's in a smouldering crater, with the genuine Paris giving it information as to what it should do: it sees an intact army barracks, and starts digging under the "rubble" to see if anything "remains" of that barracks.
It would be interesting to get Robin Hanson to try and reconcile these AIs' beliefs ^_^
Fifth failure: Bayes nets and decisions
It seems that a Bayes net would be our salvation. We could have dependent nodes like "warmongering president", "nuclear war", or "flower show". Then we could require that the AI makes its decision dependent only on the states of these dependent nodes. And never on the original X/¬X node.
This seems safe - after all, the AI is nice given X. And if we require the AI's decisions be dependent only on subordinate nodes, then it must be nice dependent on the subordinate nodes. Therefore X/¬X is irrelevant, and the AI is always nice.
Except... Consider what a "decision" is. A decision could be something simple, or it could be "construct a sub AI that will establish X versus ¬X, and do 'blah' if X, and 'shmer' if ¬X". That's a perfectly acceptable decision, and could be made conditional on any (or all) of the subordinate nodes. And if 'blah' is nice while 'shmer' isn't, we have the same problem.
Six failure: Bayes nets and unnatural categories
OK, if decisions are too general, how about values for worlds? We take a lot of nodes, subordinate to X/¬X, and require that the AI define its utility or value function purely in terms of the states of these subordinate nodes. Again, this seems safe. The AI's value function is safe given X, by assumption, and is defined in terms of subordinate nodes that "screen off" X/¬X.
And that AI is indeed safe... if the subordinate nodes are sensible. But they're only sensible because I've defined them using terms such as "nuclear war". But what if a node is "nuclear war if X and peace in our time if ¬X"? That's a perfectly fine definition. But such nodes mean that the value function given ¬X need not be safe in any way.
This is somewhat connected with the Grue and Bleen issue, and addressing that is how I'll be hoping to solve the general problem.
[Link] Quantity Always Trumps Quality
http://www.codinghorror.com/blog/2008/08/quantity-always-trumps-quality.html
The ceramics teacher announced on opening day that he was dividing the class into two groups. All those on the left side of the studio, he said, would be graded solely on the quantity of work they produced, all those on the right solely on its quality. His procedure was simple: on the final day of class he would bring in his bathroom scales and weigh the work of the "quantity" group: fifty pound of pots rated an "A", forty pounds a "B", and so on. Those being graded on "quality", however, needed to produce only one pot - albeit a perfect one - to get an "A".
Well, came grading time and a curious fact emerged: the works of highest quality were all produced by the group being graded for quantity. It seems that while the "quantity" group was busily churning out piles of work - and learning from their mistakes - the "quality" group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay.
For some reason it just seems we in particular could learn something from this anecdote.
Iterate more. The practice effect is your friend as is mining out positive outliers in really huge sets. I wanted to also mention something about using going meta as a way to procrastinate but I feared I would summon a Newsome.
Edit: This has been mentioned before. I think it is good to remind people of it.
Not only has it been mentioned before, last time it came up I searched and failed to find corroboration of the claim that it actually happened. Since applying a deliberately inconsistent grading rubric is not something professors are normally allowed to do, I strongly suspect that the anecdote is fictional.
It is therefore best to assume this is a parable.
Go Try Things
So this isn't quite done, and its late here so I don't quite trust my judgements about writing at this hour. I've never done a top-level post before, so I wanted to get some feedback first.
Failure isn’t that Bad
You’ve probably read about how to properly turn information into beliefs, and how to squeeze every last bit from your data. What seems not to have received as much attention is the importance of just going and getting data.
For precise and well-defined fields and problems, clear thinking and reasoning will get you really far. Mathematics departments don’t use that much equipment, and they’ve been going fine for hundreds of years.
For more mundane day-to-day concerns, getting data is probably more important than being rational. Where Rationality helps you get an accurate model of the world based on the data, Data gets you well, data. And practice. Your human brain can’t rederive social rules in a vacuum, no matter how smart you are, so you have to go out and get information about it. But rationality with data is far better than either alone.
Sometimes you have to get your data by actually trying. Some things are just hard to explain in words and video. Your brain has all of this built in hardware for detecting and interpreting emotions and body language, but people are comparatively terrible at talking about it. This makes learning about different social or mood-variant things online difficult. Motions are also hard to teach online. I can kind of visualize how to do a front handspring, but I really can’t transmit what it feels like to someone else without just asking them to try it. Note: I’m not saying that asking others is useless, but I am saying that its mostly only effective as a complement to actually trying.
Practice is important. As any akrasiatic or novice would know, knowledge in a field or domain doesn’t translate directly to success in it. Like muscle memory, you need practice in order to get your brain to incorporate what you know to the point that you can use it automatically. Consciously thinking about what you’re doing while you’re doing it tends to cause lag and awkwardness, and in some fields (like conversation or physical activities) is a pretty large detriment.
I had/have the problem of hesitating on acting until I’m sure that whatever I’m considering attempting is going to be successful. I’m afraid of it not working, and am willing to do anything short of doing it in order to ensure success.
This kind of hesitation though, is pretty useless. In many cases failure to act is about the same as your action failing. It avoids doing things that you regret, but it also avoids doing things in general. And if your hesitation doesn’t result in a well thought-out plan to guarantee success in the future, then not only do you fail it that one time you hesitate, you’re not going to make progress on succeeding in the future.
Sometimes failure is actually a problem (like you’ll break something if you try extreme parkour tricks and fail), but I feel like in most instances I grossly overestimate how bad failing is. To combat this I do a few things:
- Consider a failure to act as an implicit failure. Not trying is as bad as trying and failing, except for whatever costs a failed attempt incur.
- Not regret failing. As long as I learn from my mistakes then making them results in a net gain. In the long term having failed at something and learning what to do is better than not attempting it.
- Attempt to minimize the cost of a failed attempt. I hesitate a lot with social things. If I fail with a stranger and never see them again, it’s not that big of a deal. They might be annoyed, but as long as I didn’t do something super horrible to them then they’re probably going to forget about it.
So long story short, try things out. Improvement is hard unless you do, and failure seriously isn’t that bad.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)