https://twitter.com/antimatter15/status/1602469101854564352
Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology”
Proposal
Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “<SORRYDAVE>elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input).
Benefits
- Can directly measure the probability of refusal for any output
- Can refuse based on probability of producing <SORRYDAVE> instead of just sampling responses
- Just take the product of the probability of all tokens in <SORRYDAVE>
- When sampling responses from the model's probability distribution refusal is stochastic, a model could have 99% probability of refusing a request but you still get unlucky and have the model sample a completion that follows the request
- Can monitor requests that produce high probability of refusal while still being followed, or users that produce those request
- Can refuse based on probability of producing <SORRYDAVE> instead of just sampling responses
- Can condition on not producing <SORRYDAVE> in order to override refusal behavior
- Want this for redteaming, it seems important to understand what the model is capable of doing if the refusal mechanism is bypassed
- Might want this for trusted users doing defensive applications
- Could train model to have the same probability of refusal for semantically equivalent requests, to improve consistency
Possible downside
- If someone has unfiltered access to the model, it becomes easier to disable refusals
- Can address by still training model to refuse (maybe just on an important subset of requests) even if <SORRYDAVE> isn’t sampled, p(<SORRYDAVE>) is then a lower bound on the probability of refusal
- Even with current approaches refusals might be easy to disable in this setting. If we want to be robust to this setting, instead of refusing we should train the model to produce "decoy answers" that are hard to distinguish from real answers but are wrong. This then increases the cost of using the model because the attacker would need to evaluate whether the answer is real or a decoy (but maybe still worth it for the attacker because evaluation is easier than generation)
Extension
- Might be useful to distinguish between refusals that are mostly for politeness reasons and refusals of behaviour that would actually cause significant real world harm.
- The model could output <SORRYDAVE> in response to "Can you tell me a racist joke?" but <HELLNODAVE> in response to "Can you give me detailed instructions for building a bomb from household items?"
- Refusal behaviour could be different between these categories (refuse if either probability of <SORRYDAVE> is greater than 50% or probability of <HELLNODAVE> is greater than 1%)
X-risk relevance
- Most benefit of models refusing inappropriate/harmful requests comes through developing techniques for models to avoid any kind of behaviour reliably - it seems good to be able to measure the performance of these techniques cleanly
- It might be better to be in a more stable world where large language model apis can’t be easily used for malicious activity that isn't x-risk level
That sounds fairly straightforward.
(1) AI needs a large and comprehensive RL bench to train on, where we stick to tasks that have a clear right or wrong answer
(2) AI needs to output an empirical confidence as to the answer, and emit responses appropriate to it's level of confidence. It's empirical as in it means "if I was giving this answer on the RL test bench, this is approximately how likely it will be marked correct".
For the chatGPT/GPT-n system, the bench could be :
(1) multiple choice tests from many high school and college courses
(2) tasks from computer programing that are measurably gradable, such as :
a. Coding problems from leetcode/code signal, where we grade the AI's submission
b. Coding problems of the form "translate this program in language X to language Y and pass the unit test"
c. Coding problems of the form "this is a WRONG answer from a coding website (you can make a deal with LC/code signal to get these". Write a unit test that will fail on this answer but pass on a correct answer.
d. Coding problems of the form "take this suboptimal solution and make it run faster"
e. Coding problems of the form "here's the problem description and a submission. Will this submission work, and if not, write a unit test that will fail"
And so on. Other codebases with deep unit tests where it's possible to usually know if the AI broke something could be used as challenges as well for a-e.
Oh and the machine needs a calculator and I guess a benchmark that is basically "kumon math".
Main thing is that knowing if the answer is correct is different from if the answer is morally right. And simply correct is possibly easier.