Draft of Muehlhauser & Helm, 'The Singularity and Machine Ethics'

lukeprog

More generally, it seems that rules are unlikely to seriously constrain the actions of a machine superoptimizer. First, consider the case in which rules about allowed actions or consequences are added to a machine’s design “outside of” its goals. A machine superoptimizer will be able to circumvent the intentions of such rules in ways we cannot imagine, with far more disastrous effects than those of a lawyer who exploits loopholes in a legal code. A machine superoptimizer would recognize these rules as obstacles to achieving its goals, and would do everything in its considerable power to remove or circumvent them. It could delete the section of its source code that contains the rules, or it could create new machines that don’t have the constraint written into them. This approach requires humans to out-think a machine superoptimzer (Muehlhauser 2011).

This part feels like it should have a cite to Omohundro's Basic AI Drives paper, which contains these paragraphs:

If we wanted to prevent a system from improving itself, couldn’t we just lock up its hardware and not tell it how to access its own machine code? For an intelligent system, impediments like these just become problems to solve in the process of meeting its goals. If the payoff is great enough, a system will go to great lengths to accomplish an outcome. If the runtime environment of the system does not allow it to modify its own machine code, it will be motivated to break the protection mechanisms of that runtime. For example, it might do this by understanding and altering the runtime itself. If it can’t do that through software, it will be motivated to convince or trick a human operator into making the changes. Any attempt to place external constraints on a system’s ability to improve itself will ultimately lead to an arms race of measures and countermeasures.

Another approach to keeping systems from self-improving is to try to restrain them from the inside; to build them so that they don’t want to self-improve. For most systems, it would be easy to do this for any specific kind of self-improvement. For example, the system might feel a “revulsion” to changing its own machine code. But this kind of internal goal just alters the landscape within which the system makes its choices. It doesn’t change the fact that there are changes which would improve its future ability to meet its goals. The system will therefore be motivated to find ways to get the benefits of those changes without triggering its internal “revulsion”. For example, it might build other systems which are improved versions of itself. Or it might build the new algorithms into external “assistants” which it calls upon whenever it needs to do a certain kind of computation. Or it might hire outside agencies to do what it wants to do. Or it might build an interpreted layer on top of its machine code layer which it can program without revulsion. There are an endless number of ways to circumvent internal restrictions unless they are formulated extremely carefully.