The most natural way to approach this setting is to say: the AI will produce an output, and a human will produce an output (perhaps with a small probability), and the goal of the AI is to make those outputs indistinguishable to some evaluator.
Instead you are proposing using "reduced impact AI," i.e. saying: the AI produces an output, and a human produces an output, and then one of them at random is printed and the other is shredded. The goal of the AI os to make the whole resulting states of the universe indistinguishable to an observer.
As far as I can tell, the only reason for all of this machinery is to prevent the AI from incidentally having other side effects on the world. Is that correct?
Even if you want to use this machinery to prevent side-effects (I'm not optimistic), it seems like we can do it much more directly.
For example:
A. With probability 1% we don't run the AI at all.
B. With probability 1% we run the AI and shred its output.
C. With probability 1% we run the AI, and have a human generate a similar output. We give the AI's output to an evaluator. Afterwards, we shred both outputs.
D. With probability 1% we do the same thing as in case C, except we give the evaluator the human's output.
E. With probability 96% we use the AI's output as intended.
Then we give the AI a penalty based on the ability of a hypothetical observer to distinguish cases A and B (conditioned on one of them occurring), as well as a penalty based on the ability of the evaluator to distinguish cases C and D (conditioned on one of them occurring). If these penalty terms are large, this forces the AI to mimic the human output while having no side effects other than via its actions.
This has the nice advantage that you don't have to put a human in a box, or kill anyone, or restrict your attention to what are essentially multiple-choice questions.
Constraining the output of an AI seems like a reasonable option to explore to me. I agree that generating a finite set of humanlike answers (with a chatbot or otherwise) might be a sensible way to do this. An AI could perform gradient descent over the solution space then pick the nearest proposed behaviour (it could work like relaxation in integer programming).
The multiple choice AI (with human-suggested options) is the most obvious option for avoiding unhumanlike behaviour. Paul has said in some medium comments that he thinks his more elaborate approach of combining mimicry and optimisation [1] would work better though.
A putative new idea for AI control; index here.
In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.
I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.
But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.
So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.
For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.
[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.