gwern comments on The Absent-Minded Driver - Less Wrong

27 Post author: Wei_Dai 16 September 2009 12:51AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (139)

You are viewing a single comment's thread. Show more comments above.

Comment author: gwern 19 January 2016 04:50:46PM *  0 points [-]

To return to this a little, here's an implementation in R:

absentMindedDriver <- function(p1, p2) { if(p1==1) { 0; } else { if(p2==1) { 4; } else { 1; } } }
runDriver <- function(p=(1/3), iters=100000) { p1 <- rbinom(n=iters, size=1, prob=p)
p2 <- rbinom(n=iters, size=1, prob=p)
return(mapply(absentMindedDriver, p1, p2)) }
sapply(c(0, 1/2, 1/3, 1/4, 4/9, 1), function(p) { mean(runDriver(p=p)); })
# [1] 1.00000 1.25183 1.33558 1.31156 1.29420 0.00000
## Plot payoff curve:
actions <- seq(0, 1, by=0.01)
ground <- data.frame(P=actions, Reward=sapply(actions, function(p) {mean(runDriver(p))}))
plot(ground)
# <https://i.imgur.com/sQGDjEU.png>

Payoffs to each probability

An agent could solve this for the optimal action in a number of ways. For example, taking a reinforcement learning perspective, we could discretize the action space into 100 probabilities: 0, 0.01, 0.02...0.98, 0.99, 1.0, and treat it as a multi-armed bandit problem with k=100 independent arms (without any assumptions of smoothness or continuity or structure over 0..1).

Implementation-wise, we can treat the arm as a level in a categorical variable, so instead of writing out Reward ~ Beta_p0 + Beta_p0.01 + ... Beta_p1.0 we can write out Reward ~ P and let the library expand it out into 100 dummy variables. So here's a simple Thompson sampling MAB in R using MCMCglmm rather than JAGS since it's shorter:

testdata <- data.frame(P=c(rep(0.00, 10), rep(0.33, 10), rep(0.66, 10), rep(0.9, 10)), Reward=c(runDriver(p=0, iters=10), runDriver(p=(0.33), iters=10), runDriver(p=(0.66), iters=10), runDriver(p=(0.9), iters=10))); testdata
# P Reward
# 1 0.00 1
# 2 0.00 1
# 3 0.00 1
# ...
summary(lm(Reward ~ as.factor(P), data=testdata))
# ...Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.0000000 0.3836955 2.60623 0.013232
# as.factor(P)0.33 0.7000000 0.5426274 1.29002 0.205269
# as.factor(P)0.66 -0.5000000 0.5426274 -0.92144 0.362954
# as.factor(P)0.9 -0.6000000 0.5426274 -1.10573 0.276178
library(MCMCglmm)
m <- MCMCglmm(Reward ~ as.factor(P), data=testdata)
summary(m$Sol)
# ...1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
# Mean SD Naive SE Time-series SE
# (Intercept) 1.0215880 0.4114639 0.01301163 0.01301163
# as.factor(P)0.33 0.6680714 0.5801901 0.01834722 0.01632753
# as.factor(P)0.66 -0.5157360 0.5635405 0.01782072 0.01782072
# as.factor(P)0.9 -0.6056746 0.5736586 0.01814068 0.01884525
#
# 2. Quantiles for each variable:
# 2.5% 25% 50% 75% 97.5%
# (Intercept) 0.2440077 0.7633342 1.0259299 1.2851243 1.8066701
# as.factor(P)0.33 -0.4521587 0.2877682 0.6515954 1.0344911 1.8871638
# as.factor(P)0.66 -1.5708577 -0.8923558 -0.5300312 -0.1495413 0.5953416
# as.factor(P)0.9 -1.7544003 -0.9542541 -0.6397946 -0.2389132 0.6077439
# k=100
actions <- seq(0, 1, by=0.01)
# We seed each action with 1 trial, to make sure each action shows up as a level of the categorical variable:
mab <- data.frame(P=actions, Reward=sapply(actions, function(a) { runDriver(p=a, iters=1) }))
# We'll run for 10,000 sequential actions total (or ~100 trials per arm on average)
iters <- 10000
for (i in 1:iters) {
m <- MCMCglmm(Reward ~ as.factor(P), data=mab, verbose=FALSE)
estimates <- data.frame()
for (col in 2:ncol(m$Sol)) {
# take 1 estimate from MCMCglmm's posterior samples for each level of the categorical variable P:
estimates <- rbind(estimates,
data.frame(P=as.numeric(sub("as.factor\\(P\\)", "", names(m$Sol[1,col]))),
RewardSample=m$Sol[1,col]))
}
# since reward is already defined as utility, our utility function is just the identity:
utilityFunction <- function(estimate) { identity(estimate); }
actionChoice <- estimates[which.max(utilityFunction(estimates$RewardSample)),]$P
result <- runDriver(p=actionChoice, iters=1)
mab <- rbind(mab, data.frame(P=actionChoice, Reward=result))
cat(paste0("I=", i, "; tried at ", actionChoice, ", got reward: ", result, "\n"))
}
# ...I=9985; tried at 0.23, got reward: 1
# I=9986; tried at 0.45, got reward: 4
# I=9987; tried at 0.43, got reward: 4
# I=9988; tried at 0.39, got reward: 1
# I=9989; tried at 0.39, got reward: 0
# I=9990; tried at 0.53, got reward: 0
# I=9991; tried at 0.44, got reward: 0
# I=9992; tried at 0.84, got reward: 0
# I=9993; tried at 0.32, got reward: 4
# I=9994; tried at 0.16, got reward: 4
# I=9995; tried at 0.09, got reward: 1
# I=9996; tried at 0.15, got reward: 1
# I=9997; tried at 0.02, got reward: 1
# I=9998; tried at 0.28, got reward: 1
# I=9999; tried at 0.35, got reward: 1
# I=10000; tried at 0.35, got reward: 4
## Point-estimates of average reward from each arm:
plot(mab$P)
# <https://i.imgur.com/6dTvXC8.png>

Mean reward for each of the 100 arms in the MAB for the Absent-Minded Driver

We can see just from the sheer variability of the actions>0.5 that the MAB was shifting exploration away from them, accepting much larger uncertainty in exchange for focusing on the 0.2-0.4 where the optimal action lies in; it didn't manage to identify 0.33 as that optimum due to what looks like a few unlucky samples in this run which put 0.33 below, but it was definitely getting there. How much certainty did we gain about 0.33 being optimal as compared to a naive strategy of just allocating 10000 samples over all arms?

fixed <- data.frame()
for (a in actions) { fixed <- rbind(fixed, data.frame(P=a, Reward=runDriver(p=a, iters=100))); }
## We can see major variability compared to the true curve:
plot(aggregate(Reward ~ P, fixed, mean))
# <https://i.imgur.com/zmNdwfC.png>

Mean reward for each arm as estimated by a fixed-sample trial allocating 100 samples to each arm.

We can ask the posterior probability estimate of 0.33 being optimal by looking at a set of posterior sample estimates for arms 1-100 and seeing in how many sets 0.33 has the highest parameter estimate (=maximum reward); we can then compare the posterior from the MAB with the posterior from the fixed-sample trial:

posteriorProbabilityOf0.33 <- function (d) {
mcmc <- MCMCglmm(Reward ~ as.factor(P), data=d, verbose=FALSE)
posteriorEstimates <- mcmc$Sol[,-1]
trueOptimum <- which(colnames(posteriorEstimates) == "as.factor(P)0.33")
return(table(sapply(1:nrow(posteriorEstimates), function(i) { which.max(posteriorEstimates[i,]) == trueOptimum; }))) }
posteriorProbabilityOf0.33(mab)
# FALSE TRUE
# 998 2
posteriorProbabilityOf0.33(fixed)
# FALSE
# 1000

So the MAB thinks 0.33 is more than twice as likely to be optimal than the fixed trial did, despite some bad luck. It also did so while gaining a much higher reward, roughly equivalent to choosing p=0.25 each time (optimal would've been ~1.33, so our MAB's regret is (1.33-1.20) * 10000 = 1300 vs the fixed's in-trial regret of 3400 and infinite regret thereafter since it did not find the optimum at all):

mean(fixed$Reward) # [1] 0.9975247525
mean(df$Reward) # [1] 1.201861202
Comment author: gwern 19 January 2016 04:57:39PM 0 points [-]

While I was at it, I thought I'd give some fancier algorithms a spin using Karpathy's <code>Reinforce.js</code> RL library (Github). The DQN may be able to do much better since it can potentially observe the similarity of payoffs and infer the underlying function to maximize.

First a JS implementation of the Absent-Minded Driver simulation:

function getRandBinary(p=0.5) { rand = Math.random(); if (rand >= p) { return 0; } else { return 1; } }
function absentMindedDriver(p1, p2) { if(p1==1) { return 0; } else { if (p2==1) { return 4; } else { return 1; } } }
function runDriver(p) { p1 = getRandBinary(p)
p2 = getRandBinary(p)
return absentMindedDriver(p1, p2); }
function runDrivers(p=(1/3), iters=1000) {
var rewards = [];
for (i=0; i < iters; i++) {
rewards.push(runDriver(p));
}
return rewards;
}

Now we can use Reinforce.js to set up and run a deep Q-learning agent (but not too deep since this runs in-browser and there's no need for hundreds or thousands of neurons for a tiny problem like this):

// load Reinforce.js
var script = document.createElement("script");
script.src = "<https://raw.githubusercontent.com/karpathy/reinforcejs/master/lib/rl.js>";
document.body.appendChild(script);
// set up a DQN RL agent and run
var env = {};
env.getNumStates = function() { return 0; }
env.getMaxNumActions = function() { return 100; } // so actions are 1-100?
// create the DQN agent: relatively small, since this is an easy problem; greedy, since only one step; hold onto all data & process heavily
var spec = { num_hidden_units:25, experience_add_every:1, learning_steps_per_iteration:100, experience_size:10100 }
agent = new RL.DQNAgent(env, spec);
// OK, now let it free-run; we give it an extra 100 actions for equality with the MAB's initial 100 exploratory pulls
for (i=0; i < 10100; i++) {
s = [] // MAB is one-shot, so no global state
var action = agent.act([]);
//... execute action in environment and get the reward
reward = runDriver(action/100);
console.log("Action: " + action + "; reward: " + reward);
agent.learn(reward); // the agent improves its Q,policy,model, etc. reward is a float
}
// ...Action: 15; reward: 1
// Action: 34; reward: 0
// Action: 34; reward: 4
// Action: 21; reward: 4
// Action: 15; reward: 1
// Action: 43; reward: 0
// Action: 21; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 15; reward: 1
// Action: 78; reward: 0
// Action: 15; reward: 0
// Action: 15; reward: 1
// Action: 34; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 24; reward: 1
// Action: 0; reward: 1
// Action: 34; reward: 0
// Action: 3; reward: 1
// Action: 51; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 1
// Action: 43; reward: 4
// Action: 36; reward: 4
// Action: 36; reward: 0
// Action: 21; reward: 4
// Action: 77; reward: 0
// Action: 21; reward: 4
// Action: 26; reward: 4

Overall, I am not too impressed by the DQN. It looks like it is doing worse than the MAB was, despite having a much more flexible model to use. I don't know if this reflects the better exploration of Thompson sampling compared to the DQN's epsilon-greedy, or if my fiddling with the hyperparameters failed to help. It's a small enough problem that a Bayesian NN is probably computationally feasible, but I dunno how to use those.