OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173
I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.
Part 1: The assumptions
Class 1: There are no problems with the human overseer.
1.1:Human-generated vulnerabilities are completely eliminated through security amplific... (read more)
Only just learned of this, unfortunately.
OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173
I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.
Part 1: The assumptions
Class 1: There are no problems with the human overseer.
1.1: Human-generated vulnerabilities are completely eliminated through security amplific... (read more)