You're insanely fast. It's quite humbling. I've been doing a course on Reinforcement Learning which is based on the same book, so I've been reading it, too. Got an exam coming up soon. Didn't expect to see it mentioned on LW; I didn't feel like it was on the same level as stuff in Miri's research guide, though I do agree that it's pretty good.
One thing I found a bit irritating was that, during the Multi-armed bandit chapter, it never mentioned what seems like the obvious approach: to do all the exploring at the beginning, and then only exploit for the remainder of the time (given that the problem is stationary). This should be far better than being -greedy. My approach would have been to start from that idea and see how it can be optimized. In a talk on AI Alignment, Eliezer actually briefly mentions the setting, and that's exactly the approach he says is best.
It is sort of like the optimistic approach the book mentions at some point, where you assign unrealistically high expected values to all bandits, such that any usage will update their estimate downward and thus greedy selection will alternate between them in the beginning. But that seems like a clunky way to do it.
An aside: your link to the book below "Final Thoughts" is broken, there's a bad ']' at the end.
So I think that only works in environments which are both stationary and deterministic. Otherwise, you'd need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you'd spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can't always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.
This more or less regresses to an offline supervised learning model in which a bunch of samples are collected upfront, a model trained, and then used to predict all future actions. While you might be framing your problem as an MDP, you're not doing reinforcement learning in this case. As TurnTrout mentioned in a sibling to this comment it works only in the stationary & deterministic environments which represent toy problems for the space, but ultimately the goal for RL is to function in non-stationary non-deterministic environments so it makes little sense to focus on this path.
Foreword
Let's get down to business.
Reinforcement Learning
1: Introduction
2: Multi-armed Bandits
Bandit basics, including nonstationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.
Some explore/exploit results are relevant to daily life - I highly recommend reading Algorithms to Live By: The Computer Science of Human Decisions.
3: Finite Markov Decision Processes
The framework.
4: Dynamic Programming
Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.
5: Monte Carlo Methods
Prediction, control, and importance sampling.
Importance Sampling
After gathering data with our behavior policy b, we then want to approximate the value function for the target policy π. In off-policy methods, the policy we use to gather the data is different from the one whose value vπ we're trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of vπ, so we must overcome this bias.
If b can take all of the actions that π can (i.e., ∀a,s:π(a|s)>0⟹b(a|s)>0), we can overcome by adjusting the return Gt of taking a series of actions At,…,AT−1 using the importance-sampling ratio ρt:T−1:=∏T−1k=tπ(Ak|Sk)b(Ak|Sk). This cleanly recovers vπ by the definition of expectation:
Then after observing some set of returns (where {Gt}t∈T(s) are the relevant returns for state s), we define the state's value as the average adjusted observed return
However, the ρt:T(t)−1's can be arbitrarily large (suppose π(A|S)=.5 and b(A|S)=1×10−10; .51×10−10=.5×1010), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try
Death and Discounting
Instead of viewing the discount factor γ as measuring how much we care about the future, think of it as encoding the probability we don't terminate on any given step. That is, we expect with 1−γ probability to die on the next turn, so we discount rewards accordingly.
This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there's a non-trivial chance that humanity can be more than just a flash in the pan, you probably care more about the far future.
6: Temporal-Difference Learning
The tabular triple threat: TD(0), SARSA, and Q-learning.
Learning TD Learning
TD(0)⋆SARSAQ
maximization bias
With great branching factors come great biases, and optimistic bias is problematic for Q-learning.
Refresher: the Q-learning update rule for state St, action At, and new state St+1 is
Suppose the rewards for all 100 actions at St+1 are distributed according to N(0,1). All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is Φ(.8)100≈4.6×10−11. The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.
To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double Q-learning scheme works because both Q-learners are unlikely to be biased in the same way. For some reason, this wasn't discovered until 2010.
7: n-step Bootstrapping
n-step everything.
8: Planning and Learning with Tabular Methods
Models, prioritized sweeping, expected vs. sample updates, MCTS, and rollout algorithms.
Roles Models Play
Distribution models include the full range of possible futures and their probabilities. For example, a distribution model for two fair coins: {HH: .25, HT: .5, TT: .25}.
Sample models just let you, well, sample. HH, HT, HT, HT, HH, TT, HH, HT... You could sample thousands of times and gain a high degree of confidence that the above distribution model is generating the data, but this wouldn't be your only hypothesis (granted, it might have the lowest Kolmogorov complexity).
Note that a distribution model also can function as a sample model.
9: On-policy Prediction with Approximation
We finally hit the good stuff: value-function approximators and stochastic-/semi-gradient methods.
10: On-policy Control with Approximation
11: Off-policy Methods with Approximation
The deadly triad of function approximation, bootstrapping, and off-policy training converge to sow d̴iv̢erge͏nc̸e͟ and i͟nstab̕il̡i̡t̷y in your methods. Metal.
12: Eligibility Traces
I was told to skip this chapter; the first half was my favorite part of the book.
TD(λ) uses a backwards-view eligibility trace to update features, which I find elegant. Suppose you have some feature extraction function ϕ:S→Rd. Then you apply your TD update not only based on the features relevant at the current state, but also to the time-decaying traces of the features of previously-visited states. λ∈[0,1] sets how quickly this eligibility decay happens; λ=0 recovers TD(0), while λ=1 recovers Monte-Carlo return methods.
When I was a kid, there was a museum I loved going to - it had all kinds of wacky interactive devices for kids. One of them took sounds and "held them in the air" at a volume which slowly decreased as a function of how long ago the sound was made. State features are sounds, and volume is eligibility.
13: Policy Gradient Methods
The policy gradient theorem, REINFORCE, and actor-critic.
14: Psychology
Creating a partial mapping between reinforcement learning and psychology.
Mental, Territorial
There was a word I was looking for that "mental model" didn't quite seem to fit: "the model with respect to which we mentally simulate courses of action". CFAR's "inner sim" terminology didn't quite map either, as to me, that points to the system-in-motion more than that-on-which-the-system-runs. The literature dubs this a cognitive map.
In light of my work on whitelisting, I've been thinking about why we're so "object-oriented" in our mental lives. An off-the-cuff hypothesis: to better integrate with the rest of our mental models, the visual system directly links up to our object schema. One such object is then recognized and engraved as a discrete "thing" in our map. Hence, we emotionally "know" that the world "really is" made up of objects, and isn't just a collection of particles.
Lahav and Mioduser's research somewhat supports this idea, suggesting that blind people not only have lower-fidelity and more declarative (as opposed to procedural / interactive) cognitive maps, they're also less likely to provide object-to-object descriptions.
Epistemic status: I made a not-obviously-wrong hypothesis and found two pieces of corroborating evidence. If anyone who knows more about this wants to chime in, please do!
15: Neuroscience
The reward prediction error hypothesis, dopamine, neural actor-critic, hedonistic neurons, and addiction.
16: Applications and Case Studies
From checkers to checkmate.
17: Frontiers
Temporal abstraction, designing reward signals, and the future of reinforcement learning. In particular, the idea I had for having a whitelist-enabled agent predict expected object-level transitions is actually one of the frontiers: general value functions. Rad.
Pandora's AI Boxing
This chapter talks a fair amount about the serious challenges in AI alignment (not sure if you all have heard of that problem), which is heartening.
I'm not sure about that. Admittedly, I'm not as familiar with those solutions, but the challenge here seems to be of a qualitatively different caliber. Conditional on true AI's achievement, we'd want to have extremely high confidence that it pans out before we flip the switch. The authors acknowledge that
I don't think it's impossible, but it's going to be extremely hard to get formal probabilistic guarantees. I mean, if you don't know an agent's rationality, you can't learn their utility function. If you do know their rationality but not their probability distribution over outcomes, you still can't learn their utility function.
This is just one of the many open problems in alignment theory. If that's not enough, there are always the unknown unknowns...
Final Thoughts
I read the "nearly-complete" draft of the second edition, available here. It was pretty good, but I did find most of the exercises either too easy or requiring considerable programming investment to set up the environment described. The former makes sense, as I've already taken a course on this, and I'm probably a bit above the introductory level.
Some graphs could have been more clearly designed, and some math in the proof of Linear TD(0)'s convergence (p. 206-207) is underspecified:
An additional precondition: y can't be the zero vector.
Unless I totally missed what "this type" entails, this is false if taken at face value:
has nonnegative column sums and is also indefinite, having eigenvalues of 3 and -7.
However, the claim is true in the subtle way they use it - they're actually showing that since the matrix is symmetric, real, and strictly diagonally dominant with positive diagonal entries, it's also positive definite. This could be made clearer.
In all, reading this book was definitely a positive experience.
Forwards
I'll be finishing Analysis II before moving on to Jaynes's Probability Theory in preparation for a class in the fall.
Dota 2
Recently, OpenAI made waves with their OpenAI Five Dota 2 bot. To REINFORCE what I just learned and solidified, I might make a post in the near future breaking down how Five differs from the Alpha(Go) Zero approach, quantifying my expectations for The International for calibration.
No Longer a Spectator
Four months and one week ago, I started my journey through the MIRI reading list. In those dark days, attempting a proof induced a stupor similar to that I encountered approaching a crush in grade school, my words and thoughts leaving me.
Six textbooks later and with a little effort, I'm able to prove things like the convergence of Monte Carlo integration to the Riemann integral, threading together lessons from All of Statistics and Analysis I; target in mind, words and thoughts remaining firmly by my side.
If you are interested in working together to learn MIRI-relevant math, if you have a burning desire to knock the alignment problem down a peg, if you're in a scale-tilting mood - I would be more than happy to work with you. Messaging me may also net you an invitation to the MIRIx Discord server.