Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.
Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.
Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.
Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the even that the output signal from X is scrambled away from the the original output.
Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.
Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!
The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.
It occurred to me that the anti-Pascaline agent design could be used as part of a satisficer approach.
The obvious thing to reduce dangerous optimisation pressure is to make a bounded utility function, with an easily achievable bound. Such as giving them a utility linear in paperclips that maxs out at 10.
The problem with this is that, if the entity is a maximiser (which it might become), it can never be sure that it's achieved its goals. Even after building 10 paperclips, and an extra 2 to be sure, and an extra 20 to be really sure, and an extra 3^^^3 to be really really sure, and extra cameras to count them, with redundant robots patrolling the cameras to make sure that they're all behaving well, etc... There's still an ε chance that it might have just dreamed this, say, or that its memory is faulty. So it has a current utility of (1-ε)10, and can increase this by reducing ε - hence by building even more paperclips.
Hum... ε, you say? This seems a place where the anti-Pascaline design could help. Here we would use it at the lower bound of utility. It currently has probability ε of having utility < 10 (ie it has not built 10 paperclips) and (1-ε) of having utility = 10. Therefore and anti-Pascaline agent with ε lower bound would round this off to 10, discounting the unlikely event that it has been deluded, and thus it has no need to build more paperclips or paperclip counting devices.
Note that this is an un-optimising approach, not an anti-optimising one, so the agent may still build more paperclips anyway - it just has no pressure to do so.
View more: Next