Like Self-fulfilling misalignment data might be poisoning our AI models, what are historical examples of self-fulfilling prophecies that have affected AI alignment and development?
Put a few potential examples below to seed discussion.
Like Self-fulfilling misalignment data might be poisoning our AI models, what are historical examples of self-fulfilling prophecies that have affected AI alignment and development?
Put a few potential examples below to seed discussion.
Training on Documents About Reward Hacking Induces Reward Hacking