Larks comments on Stupid Questions Open Thread - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (265)
In this interview between Eliezer and Luke, Eliezer says that the "solution" to the exploration-exploitation trade-off is to "figure out how much resources you want to spend on exploring, do a bunch of exploring, use all your remaining resources on exploiting the most valuable thing you’ve discovered, over and over and over again." His point is that humans don't do this, because we have our own, arbitrary value called boredom, while an AI would follow this "pure math."
My potentially stupid question: doesn't this strategy assume that environmental conditions relevant to your goals do not change? It seems to me that if your environment can change, then you can never be sure that you're exploiting the most valuable choice. More specifically, why is Eliezer so sure that what wikipedia describes as the epsilon-first strategy is always the optimal one? (Posting this here because I assume he has read more about this than me and that I am missing something.)
Edit 12/30 8:56 GMT: fixed typo in last sentence of second paragraph.
You should probably be prepared to change how much you plan to spend on exploring based on the initial information recieved.
This has me confused as well.
Assume a large area divided into two regions. Region A has slot machines with average payout 50, while region B has machines with average payout 500. I am blindfolded and randomly dropped into region A or B. The first slot machine I try has payout 70. I update in the direction of being in region A. Doesn't this affect how many resources I wish to spend doing exploration?
Are you also assuming that you know all of those assumed facts about the area?
I would certainly expect that how many resources I want to spend on exploration will be affected by how much a priori knowledge I have about the system. Without such knowledge, the amount of exploration-energy I'd have to expend to be confident that there are two regions A and B with average payout as you describe is enormous.
Do you mean to set the parameter specifying the amount of resources (e.g., time steps) to spend exploring (before switching to full-exploiting) based on the info you receive upon your first observation? Also, what do you mean by "probably"?