I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:
Slightly more spelled-out thoughts about bounded minds:
I suspect there is some merit to the Scientist's intuition (and the idea that constant returns are more "empirical") which nobody has managed to explain well. I'll try to explain it here.[1]
The Epistemologist's notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence. The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence. The Epistemologist is right that there is nothing of value in the empiricist's notion if you are an unbounded Solomonoff inductor. But for a bounded mind, two important facts come into play:
Therefore a bounded mind will sometimes get more evidence from "fast-program induction on local data" (i.e. just extrapolate without a gears-level model) than from highly conjunctive arguments about gears-level models.
FWIW, I agree with the leading bit of Eliezer's position -- that we should think about the object-level and not be dismissive of arguments and concretely imagined gears-level models.
I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.
@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them? (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what's in your interest, or by any other means.)
Sigmoid is usually what "straight line" should mean for a quantity bounded at 0 and 1. It's a straight line in logit-space, the most natural space which complies with that range restriction.
(Just as exponentials are often the correct form of "straight line" for things that are required to be positive but have no ceiling in sight.)
We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?
If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It is notable though that EDT introduces this loopiness into the world model's job.)
Shouldn't the second one be 1√k?
Is this meant to say "last token" instead of "past token"?