You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

moridinamael comments on Open Thread, Apr. 13 - Apr. 19, 2015 - Less Wrong Discussion

4 Post author: Gondolinian 13 April 2015 12:19AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (319)

You are viewing a single comment's thread.

Comment author: moridinamael 14 April 2015 03:06:21PM *  2 points [-]

Is there a knockdown general argument against "blind" AI boxing as a general strategy?

For example, what is wrong with the following strategy:

The (probable) uFAI is instantiated inside a sandboxed virtual environment. The uFAI is tasked with optimizing for the revealed preferences of a species of complex but nonsentient artificial life form that lives inside the virtual environment.

As a Boxing technique we have the following:

  • Assume the AI is smart enough to figure out it's in a box; explicitly penalize its objective function for any changes it makes to the outside world.

  • Give the artificial life forms an explicit preference that the external world (if one exists) remain causally pristine.

  • Provide at least one layer of virtual sandboxing. In other words, the simulation is itself running on a software simulation of a computer embedded in some simple artificial physics. If the behavior of the simulated computer deviates in any statistically detectable way from the behavior of a control, the box is terminated.

  • The AI is given a specific, relatively short time window in which to complete its task. (This can be built directly into the objective function.) After that time, the simulation is ended and the scientists look only at what the AI did inside the box in order to optimize for the values of the artificial agents.

If the uFAI has not done something horrific inside the box, run the experiment again with slightly more earth-like simulated physics and slightly more human-like artificial agents. Gradually observe what kinds of solutions it comes up with. If it has done something horrific, modify its source and start over.

Problems I notice immediately: I've committed the sin of trying to specify AI behavior with English sentences. I've also exposed the scientists to Basilisks, etc., but if the time-preference is implemented right, the AI should have no motive to leave cognitive bombs. Also the AI should have no clue what humans are like unless the layered boxing strategy has already failed. Finally, I'm leaving the implementation of advanced uFAI generated solutions in the hands of the human operators.

All that said - again, is there a general reason why boxing can't work as a path to experimenting on uFAI?

Comment author: Viliam 15 April 2015 07:42:53AM *  1 point [-]

Give the artificial life forms an explicit preference that the external world (if one exists) remain causally pristine.

The AI is given a specific, relatively short time window in which to complete its task.

These two things seem to contradict each other. How should AI both complete a task for you and not influence you causally?

Comment author: Illano 15 April 2015 05:44:14PM 1 point [-]

Exactly. Any observations you make on the AI, essentially give it a communications channel to the outside world. The original AI Box experiment chooses a simple text interface as the lowest bandwidth method of making those observations, as it is the least likely to be exploitable by the AI.