LESSWRONG
LW

eric_langlois — LessWrong

Replying toClarifying Consequentialists in the Solomonoff Prior

Clarifying Consequentialists in the Solomonoff Prior

The part about the reasoners having an arbitrary amount of time to think wasn't obvious to me. The TM can run for arbitrarily long but if it is simulating a universe and using the universe to determine its output then the TM needs to specify a system for reading from the universe.

If that system involves a start-to-read time that is long enough for the in-universe life to reason about the universal prior then that time specification alone would take a huge number of bits.

On the other hand, I could imagine a scheme that looks for a specific short trigger sequence at a particular spatial location then starts reading out. If this trigger sequence is unlikely to occur naturally then the civilization would have as long as they want to reason about the prior. So overall it does seem plausible to me now to allow for arbitrarily long in-universe time.

Bounding Goodhart's Law

eric_langlois

Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").

Definitions

Consider a Markov Decision Process (MDP) $M = (S, A, T, R^{*})$ where

$S$ is the set of states,
$A$ is the set of actions,
$T : S \times A \times S \to R$ are the conditional transition probabilities, and

... (read 1345 more words →)

Replying toPrize for probable problems

eric_langlois8y

Prize for probable problems

I won't deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn't you just say that corrigibility is a value that H has? Then use the same argument with "corrigibility" in place of "value"? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).

If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.

Replying toPrize for probable problems

eric_langlois8y

Prize for probable problems

Point 1: Meta-Execution and Security Amplification

I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the "corruptibility" of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.

As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there's probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and... (read 699 more words →)