gwern comments on The Absent-Minded Driver - Less Wrong

27 Post author: Wei_Dai 16 September 2009 12:51AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (139)

You are viewing a single comment's thread.

Comment author: gwern 02 February 2011 09:37:57PM *  4 points [-]

In #lesswrong, me, Boxo, and Hyphen-ated each wrote a simple simulation/calculation of the absent-minded driver and checked how the various strategies did. Our results:

  • 1/3 = 1.3337, 1.3338174; Hyphen-ated: 1.33332; Boxo: 1.3333333333333335
  • 1/4 = 1.3127752; Hyphen-ated: 1.31247; Boxo: 1.3125
  • 2/3 = 0.9882
  • 4/9 = 1.2745, 1.2898, 1.299709, 1.297746, 1.297292, 1.2968815; Hyphen-ated: 1.29634; Boxo: 1.2962962962962963

As expected, 4/9 does distinctly worse than 1/3, which is the best strategy of the ones we tested. I'm a little surprised at Piccione & Rubinstein - didn't they run any quick* little simulation to see whether 4/9 was beaten by another probability or did they realize this and had some reason bad performance was justifiable? (I haven't read P&R's paper.)

Boxo used his UDT in Haskell code ([eu (Strategy $ const $ chance p) absentMinded id | p <- [0, 1/2, 1/3, 1/4, 4/9]]); I used some quick and dirty Haskell pasted below; and I don't know what Hyphen-ated used.

import System.Random
import Control.Monad
main = foo >>= print -- test out the 1/4 strategy. obvious how to adapt
foo = let n = 10000000 in fmap (\x -> fromIntegral (sum x) / fromIntegral n) $ replicateM n run
run = do x <- getStdRandom (randomR (1,4))
y <- getStdRandom (randomR (1,4))
return $ trial x y
trial :: Int -> Int -> Int
trial x y | x == 1 = 0 -- exit at first turn with reward 0
| y == 1 = 4 -- second turn, reward 4
| otherwise = 1 -- missed both, reward 1

* It's not like this is a complex simulation. I'm not familiar with Haskell's System.Random so it took perhaps 20 or 30 minutes to get something working and then another 20 or 30 minutes to run the simulations with 1 or 10 million iterations because it's not optimized code, but Boxo and Hyphen-ated worked even faster than that.

Comment author: Hyphen-ated 02 February 2011 09:55:56PM *  2 points [-]

I slapped together some C++ in under ten minutes, with an eye towards making it run fast. This runs a billion iterations in slightly under one minute on my machine.

#include <cstdlib>
#include <ctime>
#include <iostream>
using namespace std;
long long threshhold = 4 * (RAND_MAX / 9);
int main() {
srand((unsigned)time(0));
long long utility = 0;
long long iterations = 1000000000;
for(long long i = 0; i < iterations; ++i) {
if(rand() < threshhold)
continue;
else if(rand() < threshhold) {
utility += 4;
continue;
}
++utility;
}
cout << "avg utility: " << (double)utility/iterations << endl;
return 0;
}
Comment author: HoverHell 31 December 2011 08:23:18PM 1 point [-]

Another variation of simulation, in Python, and bit more defined strategy-abstraction (though written for improving own understanding anyway):

from __future__ import division
import random
def run(strt): # compute (possibly stochastic) outcome of strategy execution
docont = strt() # ... no arguments.
if not docont: # at intersection X
return 0
docont = strt()
if not docont:
return 4
return 1
def evst(strt, fun=run, n=100): # evaluate strategy by running `n` times
r = 0.0
for i in range(n):
r += fun(strt)
return r/n
""" # sample use:
> ra = [(v/30, evst(lambda: random.random() < v/30, n=30000)) for v in range(30)]
> max(ra, key=lambda v: v[1]) # maximal stochastic value and its `p`
(0.6666666666666666, 1.3515666666666666)
"""
Comment author: gwern 19 January 2016 04:50:46PM *  0 points [-]

To return to this a little, here's an implementation in R:

absentMindedDriver <- function(p1, p2) { if(p1==1) { 0; } else { if(p2==1) { 4; } else { 1; } } }
runDriver <- function(p=(1/3), iters=100000) { p1 <- rbinom(n=iters, size=1, prob=p)
p2 <- rbinom(n=iters, size=1, prob=p)
return(mapply(absentMindedDriver, p1, p2)) }
sapply(c(0, 1/2, 1/3, 1/4, 4/9, 1), function(p) { mean(runDriver(p=p)); })
# [1] 1.00000 1.25183 1.33558 1.31156 1.29420 0.00000
## Plot payoff curve:
actions <- seq(0, 1, by=0.01)
ground <- data.frame(P=actions, Reward=sapply(actions, function(p) {mean(runDriver(p))}))
plot(ground)
# <https://i.imgur.com/sQGDjEU.png>

Payoffs to each probability

An agent could solve this for the optimal action in a number of ways. For example, taking a reinforcement learning perspective, we could discretize the action space into 100 probabilities: 0, 0.01, 0.02...0.98, 0.99, 1.0, and treat it as a multi-armed bandit problem with k=100 independent arms (without any assumptions of smoothness or continuity or structure over 0..1).

Implementation-wise, we can treat the arm as a level in a categorical variable, so instead of writing out Reward ~ Beta_p0 + Beta_p0.01 + ... Beta_p1.0 we can write out Reward ~ P and let the library expand it out into 100 dummy variables. So here's a simple Thompson sampling MAB in R using MCMCglmm rather than JAGS since it's shorter:

testdata <- data.frame(P=c(rep(0.00, 10), rep(0.33, 10), rep(0.66, 10), rep(0.9, 10)), Reward=c(runDriver(p=0, iters=10), runDriver(p=(0.33), iters=10), runDriver(p=(0.66), iters=10), runDriver(p=(0.9), iters=10))); testdata
# P Reward
# 1 0.00 1
# 2 0.00 1
# 3 0.00 1
# ...
summary(lm(Reward ~ as.factor(P), data=testdata))
# ...Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.0000000 0.3836955 2.60623 0.013232
# as.factor(P)0.33 0.7000000 0.5426274 1.29002 0.205269
# as.factor(P)0.66 -0.5000000 0.5426274 -0.92144 0.362954
# as.factor(P)0.9 -0.6000000 0.5426274 -1.10573 0.276178
library(MCMCglmm)
m <- MCMCglmm(Reward ~ as.factor(P), data=testdata)
summary(m$Sol)
# ...1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
# Mean SD Naive SE Time-series SE
# (Intercept) 1.0215880 0.4114639 0.01301163 0.01301163
# as.factor(P)0.33 0.6680714 0.5801901 0.01834722 0.01632753
# as.factor(P)0.66 -0.5157360 0.5635405 0.01782072 0.01782072
# as.factor(P)0.9 -0.6056746 0.5736586 0.01814068 0.01884525
#
# 2. Quantiles for each variable:
# 2.5% 25% 50% 75% 97.5%
# (Intercept) 0.2440077 0.7633342 1.0259299 1.2851243 1.8066701
# as.factor(P)0.33 -0.4521587 0.2877682 0.6515954 1.0344911 1.8871638
# as.factor(P)0.66 -1.5708577 -0.8923558 -0.5300312 -0.1495413 0.5953416
# as.factor(P)0.9 -1.7544003 -0.9542541 -0.6397946 -0.2389132 0.6077439
# k=100
actions <- seq(0, 1, by=0.01)
# We seed each action with 1 trial, to make sure each action shows up as a level of the categorical variable:
mab <- data.frame(P=actions, Reward=sapply(actions, function(a) { runDriver(p=a, iters=1) }))
# We'll run for 10,000 sequential actions total (or ~100 trials per arm on average)
iters <- 10000
for (i in 1:iters) {
m <- MCMCglmm(Reward ~ as.factor(P), data=mab, verbose=FALSE)
estimates <- data.frame()
for (col in 2:ncol(m$Sol)) {
# take 1 estimate from MCMCglmm's posterior samples for each level of the categorical variable P:
estimates <- rbind(estimates,
data.frame(P=as.numeric(sub("as.factor\\(P\\)", "", names(m$Sol[1,col]))),
RewardSample=m$Sol[1,col]))
}
# since reward is already defined as utility, our utility function is just the identity:
utilityFunction <- function(estimate) { identity(estimate); }
actionChoice <- estimates[which.max(utilityFunction(estimates$RewardSample)),]$P
result <- runDriver(p=actionChoice, iters=1)
mab <- rbind(mab, data.frame(P=actionChoice, Reward=result))
cat(paste0("I=", i, "; tried at ", actionChoice, ", got reward: ", result, "\n"))
}
# ...I=9985; tried at 0.23, got reward: 1
# I=9986; tried at 0.45, got reward: 4
# I=9987; tried at 0.43, got reward: 4
# I=9988; tried at 0.39, got reward: 1
# I=9989; tried at 0.39, got reward: 0
# I=9990; tried at 0.53, got reward: 0
# I=9991; tried at 0.44, got reward: 0
# I=9992; tried at 0.84, got reward: 0
# I=9993; tried at 0.32, got reward: 4
# I=9994; tried at 0.16, got reward: 4
# I=9995; tried at 0.09, got reward: 1
# I=9996; tried at 0.15, got reward: 1
# I=9997; tried at 0.02, got reward: 1
# I=9998; tried at 0.28, got reward: 1
# I=9999; tried at 0.35, got reward: 1
# I=10000; tried at 0.35, got reward: 4
## Point-estimates of average reward from each arm:
plot(mab$P)
# <https://i.imgur.com/6dTvXC8.png>

Mean reward for each of the 100 arms in the MAB for the Absent-Minded Driver

We can see just from the sheer variability of the actions>0.5 that the MAB was shifting exploration away from them, accepting much larger uncertainty in exchange for focusing on the 0.2-0.4 where the optimal action lies in; it didn't manage to identify 0.33 as that optimum due to what looks like a few unlucky samples in this run which put 0.33 below, but it was definitely getting there. How much certainty did we gain about 0.33 being optimal as compared to a naive strategy of just allocating 10000 samples over all arms?

fixed <- data.frame()
for (a in actions) { fixed <- rbind(fixed, data.frame(P=a, Reward=runDriver(p=a, iters=100))); }
## We can see major variability compared to the true curve:
plot(aggregate(Reward ~ P, fixed, mean))
# <https://i.imgur.com/zmNdwfC.png>

Mean reward for each arm as estimated by a fixed-sample trial allocating 100 samples to each arm.

We can ask the posterior probability estimate of 0.33 being optimal by looking at a set of posterior sample estimates for arms 1-100 and seeing in how many sets 0.33 has the highest parameter estimate (=maximum reward); we can then compare the posterior from the MAB with the posterior from the fixed-sample trial:

posteriorProbabilityOf0.33 <- function (d) {
mcmc <- MCMCglmm(Reward ~ as.factor(P), data=d, verbose=FALSE)
posteriorEstimates <- mcmc$Sol[,-1]
trueOptimum <- which(colnames(posteriorEstimates) == "as.factor(P)0.33")
return(table(sapply(1:nrow(posteriorEstimates), function(i) { which.max(posteriorEstimates[i,]) == trueOptimum; }))) }
posteriorProbabilityOf0.33(mab)
# FALSE TRUE
# 998 2
posteriorProbabilityOf0.33(fixed)
# FALSE
# 1000

So the MAB thinks 0.33 is more than twice as likely to be optimal than the fixed trial did, despite some bad luck. It also did so while gaining a much higher reward, roughly equivalent to choosing p=0.25 each time (optimal would've been ~1.33, so our MAB's regret is (1.33-1.20) * 10000 = 1300 vs the fixed's in-trial regret of 3400 and infinite regret thereafter since it did not find the optimum at all):

mean(fixed$Reward) # [1] 0.9975247525
mean(df$Reward) # [1] 1.201861202
Comment author: gwern 19 January 2016 04:57:39PM 0 points [-]

While I was at it, I thought I'd give some fancier algorithms a spin using Karpathy's <code>Reinforce.js</code> RL library (Github). The DQN may be able to do much better since it can potentially observe the similarity of payoffs and infer the underlying function to maximize.

First a JS implementation of the Absent-Minded Driver simulation:

function getRandBinary(p=0.5) { rand = Math.random(); if (rand >= p) { return 0; } else { return 1; } }
function absentMindedDriver(p1, p2) { if(p1==1) { return 0; } else { if (p2==1) { return 4; } else { return 1; } } }
function runDriver(p) { p1 = getRandBinary(p)
p2 = getRandBinary(p)
return absentMindedDriver(p1, p2); }
function runDrivers(p=(1/3), iters=1000) {
var rewards = [];
for (i=0; i < iters; i++) {
rewards.push(runDriver(p));
}
return rewards;
}

Now we can use Reinforce.js to set up and run a deep Q-learning agent (but not too deep since this runs in-browser and there's no need for hundreds or thousands of neurons for a tiny problem like this):

// load Reinforce.js
var script = document.createElement("script");
script.src = "<https://raw.githubusercontent.com/karpathy/reinforcejs/master/lib/rl.js>";
document.body.appendChild(script);
// set up a DQN RL agent and run
var env = {};
env.getNumStates = function() { return 0; }
env.getMaxNumActions = function() { return 100; } // so actions are 1-100?
// create the DQN agent: relatively small, since this is an easy problem; greedy, since only one step; hold onto all data & process heavily
var spec = { num_hidden_units:25, experience_add_every:1, learning_steps_per_iteration:100, experience_size:10100 }
agent = new RL.DQNAgent(env, spec);
// OK, now let it free-run; we give it an extra 100 actions for equality with the MAB's initial 100 exploratory pulls
for (i=0; i < 10100; i++) {
s = [] // MAB is one-shot, so no global state
var action = agent.act([]);
//... execute action in environment and get the reward
reward = runDriver(action/100);
console.log("Action: " + action + "; reward: " + reward);
agent.learn(reward); // the agent improves its Q,policy,model, etc. reward is a float
}
// ...Action: 15; reward: 1
// Action: 34; reward: 0
// Action: 34; reward: 4
// Action: 21; reward: 4
// Action: 15; reward: 1
// Action: 43; reward: 0
// Action: 21; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 15; reward: 1
// Action: 78; reward: 0
// Action: 15; reward: 0
// Action: 15; reward: 1
// Action: 34; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 24; reward: 1
// Action: 0; reward: 1
// Action: 34; reward: 0
// Action: 3; reward: 1
// Action: 51; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 1
// Action: 43; reward: 4
// Action: 36; reward: 4
// Action: 36; reward: 0
// Action: 21; reward: 4
// Action: 77; reward: 0
// Action: 21; reward: 4
// Action: 26; reward: 4

Overall, I am not too impressed by the DQN. It looks like it is doing worse than the MAB was, despite having a much more flexible model to use. I don't know if this reflects the better exploration of Thompson sampling compared to the DQN's epsilon-greedy, or if my fiddling with the hyperparameters failed to help. It's a small enough problem that a Bayesian NN is probably computationally feasible, but I dunno how to use those.