You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

jsteinhardt comments on Stupid Questions Open Thread - Less Wrong Discussion

42 Post author: Costanza 29 December 2011 11:23PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (265)

You are viewing a single comment's thread. Show more comments above.

Comment author: Andy_McKenzie 30 December 2011 04:40:12AM *  7 points [-]

In this interview between Eliezer and Luke, Eliezer says that the "solution" to the exploration-exploitation trade-off is to "figure out how much resources you want to spend on exploring, do a bunch of exploring, use all your remaining resources on exploiting the most valuable thing you’ve discovered, over and over and over again." His point is that humans don't do this, because we have our own, arbitrary value called boredom, while an AI would follow this "pure math."

My potentially stupid question: doesn't this strategy assume that environmental conditions relevant to your goals do not change? It seems to me that if your environment can change, then you can never be sure that you're exploiting the most valuable choice. More specifically, why is Eliezer so sure that what wikipedia describes as the epsilon-first strategy is always the optimal one? (Posting this here because I assume he has read more about this than me and that I am missing something.)

Edit 12/30 8:56 GMT: fixed typo in last sentence of second paragraph.

Comment author: jsteinhardt 30 December 2011 04:55:52PM 5 points [-]

You got me curious, so I did some searching. This paper gives fairly tight bounds in the case where the payoffs are adaptive (i.e. can change in response to your previous actions) but bounded. The algorithm is on page 5.

Comment author: Andy_McKenzie 30 December 2011 06:23:23PM 3 points [-]

Thanks for the link. Their algorithm, the “multiplicative update rule,” which goes about "selecting each arm randomly with probabilities that evolve based on their past performance," does not seem to me to be the same strategy as Eliezer describes. So does this contradict his argument?

Comment author: jsteinhardt 30 December 2011 11:10:33PM 1 point [-]

Yes.