If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.
Could someone point me to any existing articles on this variant of AI-Boxing and Oracle AGIs:
The boxed AGI's gatekeeper is a simpler system which runs formal proofs to verify that AGI's output satisfies a simple, formally definable. The constraint is not "safety" in general but rather is narrow enough that we can be mathematically sure that the output is safe. (This does limit potential benefits from the AGI.)
The questions about what the constraint should be remains open, and of course the fact that the AGI is physically embodied puts it in causal contact with the rest of the universe. But as a partial or short-term solution, has anyone written about it? The only one I can think of (though I can't find the specific article) is Goertzel's description of an architecture where the guardian component is separate from the main AGI.
You're probably thinking of GOLEM. The Gödel machine is another proposal along somewhat similar lines.
Some discussions more directly related to your suggestion could be: