Chatbots or set answers, not WBEs

Stuart_Armstrong

A putative new idea for AI control; index here.

In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.

I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.

But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.

So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.

For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.

[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.

A putative new idea for AI control; index here.

In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.

The problem is that the human will know their answer, and could communicate it later if they're let out of the box. Maybe we could get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.

In my version, the AI only cares about what the AI judge thinks of it's answer. It doesn't care about anything that happens afterwards. However, it wants to look as human as possible. So it can't put infohazards into it's answer.

8

Chatbots or set answers, not WBEs

8

8

8

Chatbots or set answers, not WBEs

8

8