They both seem to have a similar strategy, to train AI to not do misaligned things by giving it a bunch of examples of misaligned things, and saying don't do that.
They both fail to out of training distribution errors. ChatGPT with asking it to pretend, or even just to speak in uwu voice, and Redwood with poetic descriptions, plus weirder things.
The Redwood post shows that they thought about this problem. They have a step for "find out of distribution errors, and train more, especially on them." Plus an idea that this could still have promise with more of these steps.
If they (optimistically) remove 90% of all edge cases every iteration, then after only infinite steps, they'll have removed them all.
You don't want to be playing Sisyphus against the AI.
The model wants to complete language, aligned or not. You're trying to push it away from things you don't like. But unless you push it in exactly the right direction, you're only vaguely pushing it uphill. There is exactly one stable point uphill, and it's the exact peak. If you aren't aiming perfectly you're going to miss it.
And the ball will roll down the hill.
(Plus there's a good chance alignment isn't at the peak, and is just a random unstable point on the side of the hill)
The ball will always roll down the hill. The goal is not to push the ball. It's to redesign the hill. Turn it into a valley.
Ideas like Mechanistic Anomaly Detection sound more promising. Naive implementations of something like that might fall to the same problems. If it's "train an AI to try to notice when we're out of distribution, then correct for it," we're just adding another Sisyphus who double checks we're pushing uphill.
If each extra Sisyphus catches 90% of mistakes, then...
The design needs to be a system that says "The AI can't go out of distribution, because that's uphill."
(Response to this challenge)
I've read two things recently on similar strategies: Zvi's post on ChatGPT, and Scott Alexander's post on Redwood Research.
They both seem to have a similar strategy, to train AI to not do misaligned things by giving it a bunch of examples of misaligned things, and saying don't do that.
They both fail to out of training distribution errors. ChatGPT with asking it to pretend, or even just to speak in uwu voice, and Redwood with poetic descriptions, plus weirder things.
The Redwood post shows that they thought about this problem. They have a step for "find out of distribution errors, and train more, especially on them." Plus an idea that this could still have promise with more of these steps.
If they (optimistically) remove 90% of all edge cases every iteration, then after only infinite steps, they'll have removed them all.
You don't want to be playing Sisyphus against the AI.
The model wants to complete language, aligned or not. You're trying to push it away from things you don't like. But unless you push it in exactly the right direction, you're only vaguely pushing it uphill. There is exactly one stable point uphill, and it's the exact peak. If you aren't aiming perfectly you're going to miss it.
And the ball will roll down the hill.
(Plus there's a good chance alignment isn't at the peak, and is just a random unstable point on the side of the hill)
The ball will always roll down the hill. The goal is not to push the ball. It's to redesign the hill. Turn it into a valley.
Ideas like Mechanistic Anomaly Detection sound more promising. Naive implementations of something like that might fall to the same problems. If it's "train an AI to try to notice when we're out of distribution, then correct for it," we're just adding another Sisyphus who double checks we're pushing uphill.
If each extra Sisyphus catches 90% of mistakes, then...
The design needs to be a system that says "The AI can't go out of distribution, because that's uphill."