All of eric_langlois's Comments + Replies

The part about the reasoners having an arbitrary amount of time to think wasn't obvious to me. The TM can run for arbitrarily long but if it is simulating a universe and using the universe to determine its output then the TM needs to specify a system for reading from the universe.

If that system involves a start-to-read time that is long enough for the in-universe life to reason about the universal prior then that time specification alone would take a huge number of bits.

On the other hand, I could imagine a scheme that looks for a specific short trigge... (read more)

2Vlad Mikulik
The trigger sequence is a cool idea. I want to add that the intended generator TM also needs to specify a start-to-read time, so there is symmetry there. Whatever method a TM needs to use to select the camera start time in the intended generator for the real world samples, it can also use in the simulated world with alien life, since for the scheme to work only the difference in complexity between the two matters. There is additional flex in that unlike the intended generator, the reasoner TM can sample its universe simulation at any cheaply computable interval, giving the civilisation the option of choosing any amount of thinking they can perform between outputs, if they so choose.

I won't deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn't you just say that corrigibility is a value that H has? Then use the same argument with "corrigibility" in place of "value"? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).

If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.

6paulfchristiano
I'm imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible. If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.

Point 1: Meta-Execution and Security Amplification

I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the "corruptibility" of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.

As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there's probably a more direct analogy with lambda calculus. And by Turing co... (read more)

4Wei Dai
I find your point 1 very interesting but point 2 may be based in part on a misunderstanding. I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it's clear that in the LBO variant of IDA, A[1] can't possibly learn H's values. Instead A[1] is supposed to learn "corrigibility" from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the "corrigibility" that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.