User Comment Replies

Only just learned of this, unfortunately.

OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173

I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.

Part 1: The assumptions

Class 1: There are no problems with the human overseer.

1.1: Human-generated vulnerabilities are completely eliminated through security amplific... (read more)

LESSWRONG
LW

All of lucarade's Comments + Replies

Part 1: The assumptions