Response to Bill Hibbard:
It seems to me that every O-maximizer can be expressed as a reward maximizer. Specifically, comparing equations (2) and (3), given an O-maximizer we can define reward r sub(m) (by this notation I mean "r subscript m") as:
r sub(m) = SUM(r in R) U(r)P(r|yx sub(<=m))
and r sub(i) = 0 for i<m, where the paper sets m to the final time step, following Nick Hay. The reward maximizer so defined will behave identically with the O-maximizer.
In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define "r sub(m)" to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard's formulation of an O-maximizer as a reward-maximizer doesn't work.
If this is correct, doesn't the "characteristic behavior pattern" shown for reward maximizers in Appendix B, as stated in Section 3.1, also apply to O-maximizers?
Since the construction was incorrect, this argument does not hold.
Daniel Dewey, 'Learning What to Value'
Abstract: I.J. Good's theory of an "intelligence explosion" predicts that ultraintelligent agents will undergo a process of repeated self-improvement. In the wake of such an event, how well our values are fulfilled will depend on whether these ultraintelligent agents continue to act desirably and as intended. We examine several design approaches, based on AIXI, that could be used to create ultraintelligent agents. In each case, we analyze the design conditions required for a successful, well-behaved ultraintelligent agent to be created. Our main contribution is an examination of value-learners, agents that learn a utility function from experience. We conclude that the design conditions on value-learners are in some ways less demanding than those on other design approaches.