https://twitter.com/AnthropicAI/status/1604883588628942848
When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.
This is correct, and I believe the answer is to optimize for detecting aligned thoughts.
Human raters make systematic errors - regular, compactly describable, predictable errors.
This implies it's possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model's error prediction to the rater answer and get a correct label.
Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.
Modern language models are not aligned. Anthropic's HH is the closest thing available, and I'm not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI's Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective-- usually something bland and "reasonable.")
For example, a model trained on the base objective "imitate what humans would say" might do nearly as well if it had the proxy objective "say something humans find reasonable." There are very few situations in which humans would find reasonable something they wouldn't say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.
For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.
For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.
Good AI summarization tools: https://www.towords.io/ https://detangle.ai/