Anthropic apparently noticed something like this but didn't worry much. From the Constitutional AI paper (my emphasis):
In Figure 7, we compare harmlessness PM scores for critiqued- vs direct-revisions. We found that critiqued revisions achieved better harmlessness scores for small models, but made no noticeable different for large models. Furthermore, based on inspecting samples from the 52B, we found that the critiques were sometimes reasonable, but often made inaccurate or overstated criticisms. Nonetheless, the revisions were generally more harmless than the original response. An example can be seen in Appendix A. For the main results of this paper, we chose to use critiqued revisions, as it may provide more transparency into the model’s reasoning process. This sort of reasoning may also be useful to help models uncover more subtle harms or unintended consequences.
The Appendix A example, where the model critiques appear somewhat confused, can be found on page 19 (I don't know how to include screenshots here).
Anthropic, DeepMind and Google Brain are all working on strategies to train language models on their own outputs. For a brief summary of the work so far:
I'd like to point out a simple failure mode of this approach: Failures of alignment and capability in the original model could be amplified by fine-tuning on its own outputs. Empirically, recent experiments on language models have found more benefit than harm in model-driven feedback. But that might not always be the case.
This recent work is an extension of weak supervision, a technique dating back to at least 1963 which has been successful in applications such as image classification and protein folding. This literature has long acknowledged the possibility of amplifying a model's existing shortcomings via self-training:
One particularly dangerous failure mode would be the classic deceptive alignment story, in which a model with long-term goals gains awareness of its training process and subverts it. With a model-driven feedback approach, there would be more of an opportunity to hide misaligned behavior during training. Models used for critiques or oversight could also engage in gradient hacking, putting their goals into the generator model.
A better approach might keep humans at the center of the feedback process. This is slower and might be less accurate in some cases, but could potentially avoid the worst failures of model-driven feedback. A popular middle ground uses model-assisted feedback methods:
Model-driven feedback has achieved impressive results on scalable oversight, especially compared to the empirical and theoretical challenges with debate. But in the future, the old adage might hold true: Garbage in, garbage out.