gwern comments on The Absent-Minded Driver - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (139)
In
#lesswrong, me, Boxo, and Hyphen-ated each wrote a simple simulation/calculation of the absent-minded driver and checked how the various strategies did. Our results:As expected, 4/9 does distinctly worse than 1/3, which is the best strategy of the ones we tested. I'm a little surprised at Piccione & Rubinstein - didn't they run any quick* little simulation to see whether 4/9 was beaten by another probability or did they realize this and had some reason bad performance was justifiable? (I haven't read P&R's paper.)
Boxo used his UDT in Haskell code (
[eu (Strategy $ const $ chance p) absentMinded id | p <- [0, 1/2, 1/3, 1/4, 4/9]]); I used some quick and dirty Haskell pasted below; and I don't know what Hyphen-ated used.* It's not like this is a complex simulation. I'm not familiar with Haskell's System.Random so it took perhaps 20 or 30 minutes to get something working and then another 20 or 30 minutes to run the simulations with 1 or 10 million iterations because it's not optimized code, but Boxo and Hyphen-ated worked even faster than that.
I slapped together some C++ in under ten minutes, with an eye towards making it run fast. This runs a billion iterations in slightly under one minute on my machine.
Another variation of simulation, in Python, and bit more defined strategy-abstraction (though written for improving own understanding anyway):
To return to this a little, here's an implementation in R:
An agent could solve this for the optimal action in a number of ways. For example, taking a reinforcement learning perspective, we could discretize the action space into 100 probabilities: 0, 0.01, 0.02...0.98, 0.99, 1.0, and treat it as a multi-armed bandit problem with k=100 independent arms (without any assumptions of smoothness or continuity or structure over 0..1).
Implementation-wise, we can treat the arm as a level in a categorical variable, so instead of writing out
Reward ~ Beta_p0 + Beta_p0.01 + ... Beta_p1.0we can write outReward ~ Pand let the library expand it out into 100 dummy variables. So here's a simple Thompson sampling MAB in R using MCMCglmm rather than JAGS since it's shorter:We can see just from the sheer variability of the actions>0.5 that the MAB was shifting exploration away from them, accepting much larger uncertainty in exchange for focusing on the 0.2-0.4 where the optimal action lies in; it didn't manage to identify 0.33 as that optimum due to what looks like a few unlucky samples in this run which put 0.33 below, but it was definitely getting there. How much certainty did we gain about 0.33 being optimal as compared to a naive strategy of just allocating 10000 samples over all arms?
We can ask the posterior probability estimate of 0.33 being optimal by looking at a set of posterior sample estimates for arms 1-100 and seeing in how many sets 0.33 has the highest parameter estimate (=maximum reward); we can then compare the posterior from the MAB with the posterior from the fixed-sample trial:
So the MAB thinks 0.33 is more than twice as likely to be optimal than the fixed trial did, despite some bad luck. It also did so while gaining a much higher reward, roughly equivalent to choosing p=0.25 each time (optimal would've been ~1.33, so our MAB's regret is
(1.33-1.20) * 10000 = 1300vs the fixed's in-trial regret of 3400 and infinite regret thereafter since it did not find the optimum at all):While I was at it, I thought I'd give some fancier algorithms a spin using Karpathy's <code>Reinforce.js</code> RL library (Github). The DQN may be able to do much better since it can potentially observe the similarity of payoffs and infer the underlying function to maximize.
First a JS implementation of the Absent-Minded Driver simulation:
Now we can use Reinforce.js to set up and run a deep Q-learning agent (but not too deep since this runs in-browser and there's no need for hundreds or thousands of neurons for a tiny problem like this):
Overall, I am not too impressed by the DQN. It looks like it is doing worse than the MAB was, despite having a much more flexible model to use. I don't know if this reflects the better exploration of Thompson sampling compared to the DQN's epsilon-greedy, or if my fiddling with the hyperparameters failed to help. It's a small enough problem that a Bayesian NN is probably computationally feasible, but I dunno how to use those.