New FAI paper: 'Learning What to Value' by Daniel Dewey

lukeprog

New FAI paper: 'Learning What to Value' by Daniel Dewey — LessWrong

Comment Permalink

I requested feedback about this paper here.

One of my conclusions was that you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data - assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.

This is essentially the same result as is claimed for O-Maximisers in the paper. This undermines the thesis that O-Maximisers somehow exhibit different dynamics from reinforcement learning agents.

Update on 2011-04-30: Bill Hibbard makes an almost identical point to the observations I made in this comment. You can see it in his post - on the AGI mailing list - here.

danieldewey15y20

Response to Curt Welch:

Sadly, what he seems to have failed to realize, is that any actual implementation of an O-Maximizer or his Value-learners must also be reward maximizerr. Is he really that stupid so as not to understand they are all reward maximizer?

Zing! I guess he didn't think I was going to be reading that. To be fair, it may seem to him that I've made a stupid error, thinking that O-maximizers behave differently than reward maximizers. I'll try to explain why he's mistaken.

A reward maximizer acts so as to bring about universes in which the ... (read more)

2danieldewey15y

Response to Bill Hibbard: [...] In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define "r sub(m)" to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard's formulation of an O-maximizer as a reward-maximizer doesn't work. [...] Since the construction was incorrect, this argument does not hold.

3danieldewey15y

Thanks for posting this around! It's great to see it creating discussion. I'm working on replies to the points you, Bill Hibbard, and Curt Welch have made. It looks like I have some explaining to do if I want to convince you that O-maximizers aren't a subset of reward maximizers-- in particular, that my argument in appendix B doesn't apply to O-maximizers.

See in context

18

New FAI paper: 'Learning What to Value' by Daniel Dewey

18

18

18

New FAI paper: 'Learning What to Value' by Daniel Dewey

18

18