All of vonnik's Comments + Replies

vonnik
*Ω01-11

The argument above isn’t clear to me, because I’m not sure how you’re defining your terms.

I should note that, contrary to the statement “reward is _not_, in general, that-which-is-optimized by RL agents”, by definition "reward _must be_ what is optimized for by RL agents." If they do not do that, they are not RL agents. At least, that is true based on the way the term “reward” is commonly used in the field of RL. That is what RL agents are programmed by humans to do. They do that by changing their behavior over many trials, and testing the results of ... (read more)

3TurnTrout
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward R on a trajectory τ will provide a set of gradients which push up logits on the actions at which produced the trajectory. The gradient on the parameters θ which parameterize the policy πθ is in the direction of increasing return J: ∇θJ(πθ)=Eτ∼πθ[T∑t=0∇θlogπθ(at∣st)R(τ)] You can read more about this here. Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That's it. There's no math here that says "and the agent shall optimize for reward explicitly"; the math actually says "the agent's parameterization is locally optimized by reward on the data distribution of the observations it actually makes." Reward simply chisels cognition into agents (at least, in PG-style setups).  In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section "When is reward the optimization target of the agent?", these settings probably don't bear on smart alignment-relevant agents operating in reality.