If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
Is there a knockdown general argument against "blind" AI boxing as a general strategy?
For example, what is wrong with the following strategy:
The (probable) uFAI is instantiated inside a sandboxed virtual environment. The uFAI is tasked with optimizing for the revealed preferences of a species of complex but nonsentient artificial life form that lives inside the virtual environment.
As a Boxing technique we have the following:
Assume the AI is smart enough to figure out it's in a box; explicitly penalize its objective function for any changes it makes to the outside world.
Give the artificial life forms an explicit preference that the external world (if one exists) remain causally pristine.
Provide at least one layer of virtual sandboxing. In other words, the simulation is itself running on a software simulation of a computer embedded in some simple artificial physics. If the behavior of the simulated computer deviates in any statistically detectable way from the behavior of a control, the box is terminated.
The AI is given a specific, relatively short time window in which to complete its task. (This can be built directly into the objective function.) After that time, the simulation is ended and the scientists look only at what the AI did inside the box in order to optimize for the values of the artificial agents.
If the uFAI has not done something horrific inside the box, run the experiment again with slightly more earth-like simulated physics and slightly more human-like artificial agents. Gradually observe what kinds of solutions it comes up with. If it has done something horrific, modify its source and start over.
Problems I notice immediately: I've committed the sin of trying to specify AI behavior with English sentences. I've also exposed the scientists to Basilisks, etc., but if the time-preference is implemented right, the AI should have no motive to leave cognitive bombs. Also the AI should have no clue what humans are like unless the layered boxing strategy has already failed. Finally, I'm leaving the implementation of advanced uFAI generated solutions in the hands of the human operators.
All that said - again, is there a general reason why boxing can't work as a path to experimenting on uFAI?
These two things seem to contradict each other. How should AI both complete a task for you and not influence you causally?