First, I feel like we're talking past each other a bit.
Second, I edited this somewhat out of order, apologies if it doesn't flow.
I am trying to look at this in a worst-case scenario, I'll grant that the AI is smart enough to solve any given solvable problem in a single iteration, that it's that smart even in the first experiment, and it would prioritze discovering it's true environment and paperclipping it.
I'm proposing that there exists a sandbox which [provably] can't be gotten out of.
And also a set of problems which do not convey information about our universe.
You're using your (human) mind to predict what a postulated potentially smarter-than-human intelligence could and could not do.
Isn't that required of FAI anyway?
AI sitting inside thirty nestled sandboxes even 10 milliseconds (10^41 Planck intervals) of CPU time.
Again talking past each other, I'm thinking in terms of giving the paperclipper hours. In the ideal, there isn't a provision for letting the AI out of the sandbox. thinking a bit more... None of it's problems/results need even be applicable to our universe, except for general principles of intelligence creation. Having it construct a CEV for itself might show our motives too much, or might not. (hmmmm, we should make sure any CEV we create finds, protects, and applies itself to any simulations used in its construction, in case our simulators use our CEV in their own universe :-)
especially if you gave it motives for hiding that progress (such as pulling the plug every time it came close).
But its existing self would never experience getting close, in the same way we have no records of the superweapons race of 1918. ;-)
Between Iterations, we can retroactively withdraw information that turned out to be revealing, during iterations, it has no capacity to affect our universe.
I think we can put strong brackets around what can be done with certain amounts of information, even by a superintelligence. Knowing all our physics doesn't imply our love of shiny objects and reciprocity. 'No universal arguments' cuts both ways.
In the early 1980s Douglas Lenat wrote EURISKO, a program Eliezer called "[maybe] the most sophisticated self-improving AI ever built". The program reportedly had some high-profile successes in various domains, like becoming world champion at a certain wargame or designing good integrated circuits.
Despite requests Lenat never released the source code. You can download an introductory paper: "Why AM and EURISKO appear to work" [PDF]. Honestly, reading it leaves a programmer still mystified about the internal workings of the AI: for example, what does the main loop look like? Researchers supposedly answered such questions in a more detailed publication, "EURISKO: A program that learns new heuristics and domain concepts." Artificial Intelligence (21): pp. 61-98. I couldn't find that paper available for download anywhere, and being in Russia I found it quite tricky to get a paper version. Maybe you Americans will have better luck with your local library? And to the best of my knowledge no one ever succeeded in (or even seriously tried) confirming Lenat's EURISKO results.
Today in 2009 this state of affairs looks laughable. A 30-year-old pivotal breakthrough in a large and important field... that never even got reproduced. What if it was a gigantic case of Clever Hans? How do you know? You're supposed to be a scientist, little one.
So my proposal to the LessWrong community: let's reimplement EURISKO!
We have some competent programmers here, don't we? We have open source tools and languages that weren't around in 1980. We can build an open source implementation available for all to play. In my book this counts as solid progress in the AI field.
Hell, I'd do it on my own if I had the goddamn paper.
Update: RichardKennaway has put Lenat's detailed papers up online, see the comments.