This is a linkpost for https://arxiv.org/abs/2401.10020
I'm reminded of the Real Genius scene where they're celebrating building the death laser and Mitch says "Let the engineers figure out a use for it, that's not our concern."
Which in turn reminds me of "Once the rockets go up, who cares where they come down? That's not my department, says Werner von Braun."
This seems like a significant step towards recursive self-improvement. Not directly changing the optimization algorithm or the weights, but this is a proxy for that. Now, if the network had some deceptive behavior, it seems (from what I understand) that it could become more deceptive by providing itself training data that will cause it to become more deceptive. I haven't thought about it much, but it seems this could also cause some weird feedback loops where flukes (or random fluctuation) in the grading could cause the models to go in some weird directions since the grading model is the n-1th model: