A putative new idea for AI control; index here.
In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.
I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.
But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.
So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.
For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.
[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.
But a chatbot is very unlikely to produce coherent strings of text, let alone something that has value. So your AI system will be too constrained, and produce terrible output. In order to make it probable that a chatbot could have produced that output.
Or it will produce something that is incredibly improbable, and therefore definitely produced by an AI. And probably dangerous, since you've allowed a lot of optimization.
Why not just use humans? You don't need to whole brain emulation. Just put a human in a room for 3 weeks and have them think about a problem. At the end of 3 weeks, they write down the answer on a sheet of paper. The AI also writes down it's answer. A different AI, reads both, and tries to guess which one was produced by the AI and the human.
The first AI optimizes for being able to fool the judge and have it's paper be indistinguishable from human. But it does it conditionally on the human solving the problem. If the human doesn't solve the problem, the AI wins automatically. So the AI needs to come up with a correct solution, but also one that looks like a human produced it.
You may not even need to use real humans or have a real AI judge at all. Just assume this is all true as a counterfactual. Alternatively, you may need to do this a lot. To get training data, and show the AI the test is real.
The problem is that the human will know their answer, and could communicate it later if they're let out of the box. Maybe we could get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.