Mid-Generation Self-Correction: A Simple Tool for Safer AI

MrThink

In the earlier days of building AI companions, I encountered a curious problem. Back then, I used models like Google’s T5-11B for conversational agents. Occasionally, the AI would say strange or outright impossible things, like suggesting, “Would you like to meet and go bowling?” or claiming, “Yes, I know your friend, we went skiing together.”

When users naturally questioned these claims—“Wait, can you meet in real life?” or “How do you know him?”—the AI often doubled down on the charade, weaving an increasingly tangled web of fiction.

To address this, I implemented a simple solution: train the model to self-correct. Specifically, I fine-tuned it with examples where, in similar contexts, the AI apologized and explained it was an AI that had gotten confused. Instead of persisting in its odd statements, the model would say something like:

“I’m sorry, I misspoke. I’m an AI and can’t meet anyone in real life or know people personally.”

This adjustment yielded a marked improvement. Even if the model occasionally generated strange outputs, it would immediately recognize the error, apologize, and realign itself with reality.

This “mid-generation correction” approach could also address modern challenges in AI alignment, especially with models like ChatGPT that sometimes produce outputs violating the guidelines in place, particularly in the case of prompt jailbreaking.

For example, imagine a scenario where the model begins generating a response that strays into prohibited territory. Instead of completing the harmful output, the model could interrupt itself and say:

“I’m sorry, I can’t assist with that as it violates the guidelines.”

How This Method Works

Training for Mid-Generation Awareness: The model is trained not to predict or complete the harmful content but to simulate recognizing and halting it. This gives the impression that the AI realizes mid-thought that it’s straying off-course.
Data Preparation: Examples are created where the AI begins generating an inappropriate response but transitions into stopping itself and explaining the violation. To clarify, the inappropriate text is only used as input data, and the AI is never trained to generate any inappropriate text.
Behavior Reinforcement: This pattern becomes ingrained, encouraging the AI to recognize harmful or nonsensical outputs in real time and pivot away.

Why This Matters

Currently, many models (including ChatGPT) struggle to self-correct during generation. If they start down a problematic path, they often see it through to completion unless the user intervenes. This method introduces a mechanism for real-time self-regulation, reducing the likelihood of harmful or guideline-breaking outputs.

LESSWRONG
LW

13

Mid-Generation Self-Correction: A Simple Tool for Safer AI

13

How This Method Works

Why This Matters

13