At the beginning of this summer I finished an undergraduate degree in maths and physics and I decided to spend some time preparing for my master’s degree in AI by learning some reinforcement learning (RL). It’s not like there was a whole lot else to do this summer anyway. Before going into this I had done a module on applied ML (which basically consisted of putting data into scikit-learn functions without looking deeply at how they worked) and a general idea of the basic structure of neural networks.
The first part of the post will outline the steps I took in learning ML and RL in case anyone with a similar background is interested, and in the second part I will discuss the challenges of implementing a Deep Q-Network (DQN) algorithm on Pong directly from the original paper. I’ll also compare my approach and experience to the blog post Deep Reinforcement Learning: Pong from Pixels by Andrej Karpathy, which I didn't read until after I'd written my DQN implementation.
Yes, this game was heavily cherry-picked but at least it works some of the time!
Part I - Background
I started by looking at Spinning Up by OpenAI and reading their introduction. While reading it I thought I was understanding it fairly well but when it came time to try the exercises and implement algorithms for myself I realised I had no clue what was going on and decided to take a step back.
I’m always happiest when I understand a topic from the very basics. It’s why I preferred pure maths to applied - you start from the axioms and rarely have to rely on anything that you haven’t proved yourself earlier. For this reason I started reading Sutton and Barto’s Reinforcement Learning: An Introduction (RLAI), a textbook by two of the early big names in RL. I haven’t read any other RL textbooks but I thoroughly enjoyed the style and pacing of this book - plenty of explanation and exercises.
I read RLAI until I reached a section on neural networks (Chapter 9), at which point I switched to Goodfellow’s Deep Learning (DL), a modern classic on (deep) neural networks. I found this book a little harder to follow as there were no exercises and fewer simple examples, but it was very helpful nonetheless.
To ensure that I was understanding, I wrote my own symbol-to-symbol backpropagation code in Python and used it to construct a basic neural net capable of achieving ~90% accuracy on MNIST. I also implemented a couple of different optimisers (SGD with momentum, RMSprop), and a horrifically slow convolutional layer using for-loops. The DL textbook provides enough detail to do all of this, but I checked my results against PyTorch as I went along because working in high dimensions is tricky.
After I'd read to the end of Part II of DL and was happy that I understood the fundamentals of building and training neural networks, I went back to RLAI and read the rest of Part II of that book as well.
It was at this point that I started to stall out a little. I had decided to consolidate what I’d learnt by implementing all the different algorithms on the classic cart-pole benchmark environment (using the incredibly convenient OpenAI Gym). I initially tried using the neural networks that I’d written from scratch myself, and then switched to using PyTorch. I was getting mixed results - often they’d learn but then after achieving a good score for a few dozen episodes would crash and fail miserably. Sometimes they’d take hundreds of episodes to start learning at all. Even still I’m not sure if there were serious bugs in my implementations or if it was just a combination of sensitivity to hyperparameters and high variance methods.
Either way, I was pleased that I had managed to train an agent using the neural network code I’d written from scratch, and decided that I should move on instead of writing the same thing over and over hoping it would be perfect. I needed a new challenge.
Part II - DQN
In deciding what to try next, I read two very useful essays - one about reproducing a Deep RL paper by Matthew Rahtz, and the original Spinning Up essay by Joshua Achiam. Both give very good advice about deep RL experiments in general, and the latter has some specific suggestions of projects to try. I followed Achiam's advice and, as I'd already done REINFORCE, decided to follow the paper Playing Atari with Deep Reinforcement Learning by Mnih et al. For brevity and clarity, I will call the overall algorithm DQN and refer to the Deep Q-Network itself as the Q-net.
I had seen Rahtz and others recommend the blog post Pong from Pixels by Andrej Karpathy. I am generally very reluctant to read tutorials or look at other implementations of the things that I am trying to do because I can’t shake the feeling that it’s too much like ‘cheating’. I think I sometimes take this too far, but I decided I wouldn’t look at that blog post or any other implementations of DQN until after I’d done it myself, and as of writing this section, I have still not read it. In the next subsection I will give my experience of implementing DQN on Pong, and in the one after that I will read Pong from Pixels and evaluate whether I think I learned more doing it my way, or if it would have been better to read it from the start.
My experience
The theory behind DQN is strikingly simple if you are already familiar with Q-Learning (described in section 6.5 of RLAI and countless online tutorials). We simply replace the tabular Q-function with a convolutional neural network that takes a sequence of pixel images as input and update with gradient descent. The biggest difference is that we make use of an experience replay buffer.
When you are doing RL, normally you process each reward as it comes, either immediately after getting it (called online learning) or at the end of the current episode (offline learning). Q-learning is online, and so typically each update is done on the current reward and then that reward is discarded. Therefore each reward is used only once. Online learning also means that all updates to our policy are done only based on its most recent performance, which can for example cause the policy to get stuck always selecting and updating on the same action. If instead rewards are stored along with information about where the reward came from (in an experience replay buffer) then each can be used repeatedly. If we store a big enough buffer of these experiences, then when we randomly sample from this buffer we will get some experiences from when the policy was different, which can help to avoid the issues of correlation between updates and the current policy like getting stuck selecting the same action over and over.
I started by working on the cart-pole environment because it requires a simpler policy and therefore I would be able to see if it was working more easily. Here I ran into a theme that would continue to crop up: most of my time was spent on getting the pieces around the learning algorithm working, rather than the learning algorithm itself. Getting pixel output from the cart-pole environment without it rendering every episode was a bit tricky.
My initial implementation of the experience buffer was lazy but it worked. I’d save every transition (state, action, reward, next state) in full and then just sample the tuples as needed. Each state was built of the four most recent frames (to preserve some notion of velocity), which meant that each individual frame was stored eight times! This used up a tremendous amount of memory and my poor laptop couldn’t handle it once I tried to store 100,000 of these transitions. I reworked it to store each individual frame and references to which frames were needed to reconstruct each state - this saved a lot of memory and meant I could actually run large experiments. I could have saved even more memory by making use of the fact that both cart-pole and Pong only use a few colours, but it worked without that.
I ran DQN for one million frames on the cart-pole environment and it learned! Not perfectly, but it achieved a perfect score of 200 more than any other score and got an average return of 129.
Modifying the code to work with Pong was surprisingly simple, but before I did that I had to set up the environment, where the theme of my time being consumed by learning-adjacent issues continued. I had forgotten that OpenAI Gym already had Pong and I spent over an hour trying to set up the Arcade Learning Environment (ALE) from scratch before finding someone mention the OpenAI implementation in an old issue on ALE’s GitHub page. This was all well and good for my laptop which runs Linux, but when I tried to start a run on my Windows PC it failed and I had to spend more time fixing that.
When it finally came to modifying the code for Pong, all I had to do was change the way pixel images were pre-processed, a couple of parameters in the shape of the neural network, and the available actions. I tailored the code specifically to Pong rather than have it be general like in the paper. I initially had six possible actions but I reduced it to just three - up, down, and do nothing. I also shifted the pixel data so that the background colour was 0 to reduce its effect on the output of the network.
People often say that hyperparameters don’t matter much and that if it’s not working then you’ve probably got a bug. That may be true, but hyperparameters can sometimes make or break an experiment and it can be very difficult balancing a search for bugs that may not be there with tuning hyperparameters and starting numerous lengthy runs. Using an optimiser with an adaptive learning rate can reduce that burden but not eliminate it. As an example, I somehow got it into my head that the original paper had used a discount factor, gamma, of 0.99. I tried a run of one million frames (which took eight hours, done overnight) and got nothing. It did no better than random. I had a suspicion that it was because the discount factor was flattening out any useful information so I tried 0.999 instead. This simple change was enough to train an agent that had clearly learnt to move its paddle in front of the ball and could return it a reasonable amount of the time.
I looked at a benchmark comparison by OpenAI to get an idea of how long I’d need to run DQN for. It looks as if DQN had basically converged on Pong after a million frames. Mine had not, so I tried a longer run of three million frames. To facilitate this I spent a long time writing code that would allow me to pause a run, which I was too afraid to use when it came to it because I was afraid it would subtly break the run. Even after the full three million frames my agent still lost most of the time, but put up a good fight and won fairly regularly. I suspect that with different hyperparameters such as no discount rate and a slightly higher learning rate (and possibly a larger experience buffer, the size of which is also technically a hyperparameter) I could achieve better performance, but I don’t want to get caught in the same trap of repeatedly making small changes to something that basically works.
Overall I learned a lot about the intricacies of implementing a deep RL model and some of the potential failure modes. I spent at least 19 hours working on it, plus all of the tweaking and plotting I did between runs. I’m interested to see now how much time reading Pong from Pixels would have saved me, and if I still think it would have been ‘cheating’.
After reading
I see now that Karpathy’s blog post follows a different path and does not cover DQN. The post is excellent at building intuition for policy gradients and for the RL approach to the credit assignment problem in general. I feel like his post would not be sufficient to solve Pong from scratch without copying his code verbatim. I will admit that I was expecting it to be more in-depth with step by step instructions, but the fault lies with my expectations and not with the post.
I certainly don’t think it would have been ‘cheating’ to read ‘Pong from Pixels’ ahead of time, especially as it uses a policy gradient method rather than a value function one. I would recommend reading it but not being discouraged if you still don’t understand how it works afterwards. Neither his post nor mine gives enough detail to do it all from scratch. However if it does get you interested, in my experience Sutton and Barto’s book is a great place to start.
All my (rather messy) code is available on GitHub. This is my first attempt at a piece of writing like this so if you have any comments on content and/or style please do let me know .
At the beginning of this summer I finished an undergraduate degree in maths and physics and I decided to spend some time preparing for my master’s degree in AI by learning some reinforcement learning (RL). It’s not like there was a whole lot else to do this summer anyway. Before going into this I had done a module on applied ML (which basically consisted of putting data into scikit-learn functions without looking deeply at how they worked) and a general idea of the basic structure of neural networks.
The first part of the post will outline the steps I took in learning ML and RL in case anyone with a similar background is interested, and in the second part I will discuss the challenges of implementing a Deep Q-Network (DQN) algorithm on Pong directly from the original paper. I’ll also compare my approach and experience to the blog post Deep Reinforcement Learning: Pong from Pixels by Andrej Karpathy, which I didn't read until after I'd written my DQN implementation.
Yes, this game was heavily cherry-picked but at least it works some of the time!
Part I - Background
I started by looking at Spinning Up by OpenAI and reading their introduction. While reading it I thought I was understanding it fairly well but when it came time to try the exercises and implement algorithms for myself I realised I had no clue what was going on and decided to take a step back.
I’m always happiest when I understand a topic from the very basics. It’s why I preferred pure maths to applied - you start from the axioms and rarely have to rely on anything that you haven’t proved yourself earlier. For this reason I started reading Sutton and Barto’s Reinforcement Learning: An Introduction (RLAI), a textbook by two of the early big names in RL. I haven’t read any other RL textbooks but I thoroughly enjoyed the style and pacing of this book - plenty of explanation and exercises.
I read RLAI until I reached a section on neural networks (Chapter 9), at which point I switched to Goodfellow’s Deep Learning (DL), a modern classic on (deep) neural networks. I found this book a little harder to follow as there were no exercises and fewer simple examples, but it was very helpful nonetheless.
To ensure that I was understanding, I wrote my own symbol-to-symbol backpropagation code in Python and used it to construct a basic neural net capable of achieving ~90% accuracy on MNIST. I also implemented a couple of different optimisers (SGD with momentum, RMSprop), and a horrifically slow convolutional layer using for-loops. The DL textbook provides enough detail to do all of this, but I checked my results against PyTorch as I went along because working in high dimensions is tricky.
After I'd read to the end of Part II of DL and was happy that I understood the fundamentals of building and training neural networks, I went back to RLAI and read the rest of Part II of that book as well.
It was at this point that I started to stall out a little. I had decided to consolidate what I’d learnt by implementing all the different algorithms on the classic cart-pole benchmark environment (using the incredibly convenient OpenAI Gym). I initially tried using the neural networks that I’d written from scratch myself, and then switched to using PyTorch. I was getting mixed results - often they’d learn but then after achieving a good score for a few dozen episodes would crash and fail miserably. Sometimes they’d take hundreds of episodes to start learning at all. Even still I’m not sure if there were serious bugs in my implementations or if it was just a combination of sensitivity to hyperparameters and high variance methods.
Either way, I was pleased that I had managed to train an agent using the neural network code I’d written from scratch, and decided that I should move on instead of writing the same thing over and over hoping it would be perfect. I needed a new challenge.
Part II - DQN
In deciding what to try next, I read two very useful essays - one about reproducing a Deep RL paper by Matthew Rahtz, and the original Spinning Up essay by Joshua Achiam. Both give very good advice about deep RL experiments in general, and the latter has some specific suggestions of projects to try. I followed Achiam's advice and, as I'd already done REINFORCE, decided to follow the paper Playing Atari with Deep Reinforcement Learning by Mnih et al. For brevity and clarity, I will call the overall algorithm DQN and refer to the Deep Q-Network itself as the Q-net.
I had seen Rahtz and others recommend the blog post Pong from Pixels by Andrej Karpathy. I am generally very reluctant to read tutorials or look at other implementations of the things that I am trying to do because I can’t shake the feeling that it’s too much like ‘cheating’. I think I sometimes take this too far, but I decided I wouldn’t look at that blog post or any other implementations of DQN until after I’d done it myself, and as of writing this section, I have still not read it. In the next subsection I will give my experience of implementing DQN on Pong, and in the one after that I will read Pong from Pixels and evaluate whether I think I learned more doing it my way, or if it would have been better to read it from the start.
My experience
The theory behind DQN is strikingly simple if you are already familiar with Q-Learning (described in section 6.5 of RLAI and countless online tutorials). We simply replace the tabular Q-function with a convolutional neural network that takes a sequence of pixel images as input and update with gradient descent. The biggest difference is that we make use of an experience replay buffer.
When you are doing RL, normally you process each reward as it comes, either immediately after getting it (called online learning) or at the end of the current episode (offline learning). Q-learning is online, and so typically each update is done on the current reward and then that reward is discarded. Therefore each reward is used only once. Online learning also means that all updates to our policy are done only based on its most recent performance, which can for example cause the policy to get stuck always selecting and updating on the same action. If instead rewards are stored along with information about where the reward came from (in an experience replay buffer) then each can be used repeatedly. If we store a big enough buffer of these experiences, then when we randomly sample from this buffer we will get some experiences from when the policy was different, which can help to avoid the issues of correlation between updates and the current policy like getting stuck selecting the same action over and over.
I started by working on the cart-pole environment because it requires a simpler policy and therefore I would be able to see if it was working more easily. Here I ran into a theme that would continue to crop up: most of my time was spent on getting the pieces around the learning algorithm working, rather than the learning algorithm itself. Getting pixel output from the cart-pole environment without it rendering every episode was a bit tricky.
My initial implementation of the experience buffer was lazy but it worked. I’d save every transition (state, action, reward, next state) in full and then just sample the tuples as needed. Each state was built of the four most recent frames (to preserve some notion of velocity), which meant that each individual frame was stored eight times! This used up a tremendous amount of memory and my poor laptop couldn’t handle it once I tried to store 100,000 of these transitions. I reworked it to store each individual frame and references to which frames were needed to reconstruct each state - this saved a lot of memory and meant I could actually run large experiments. I could have saved even more memory by making use of the fact that both cart-pole and Pong only use a few colours, but it worked without that.
I ran DQN for one million frames on the cart-pole environment and it learned! Not perfectly, but it achieved a perfect score of 200 more than any other score and got an average return of 129.
Modifying the code to work with Pong was surprisingly simple, but before I did that I had to set up the environment, where the theme of my time being consumed by learning-adjacent issues continued. I had forgotten that OpenAI Gym already had Pong and I spent over an hour trying to set up the Arcade Learning Environment (ALE) from scratch before finding someone mention the OpenAI implementation in an old issue on ALE’s GitHub page. This was all well and good for my laptop which runs Linux, but when I tried to start a run on my Windows PC it failed and I had to spend more time fixing that.
When it finally came to modifying the code for Pong, all I had to do was change the way pixel images were pre-processed, a couple of parameters in the shape of the neural network, and the available actions. I tailored the code specifically to Pong rather than have it be general like in the paper. I initially had six possible actions but I reduced it to just three - up, down, and do nothing. I also shifted the pixel data so that the background colour was 0 to reduce its effect on the output of the network.
People often say that hyperparameters don’t matter much and that if it’s not working then you’ve probably got a bug. That may be true, but hyperparameters can sometimes make or break an experiment and it can be very difficult balancing a search for bugs that may not be there with tuning hyperparameters and starting numerous lengthy runs. Using an optimiser with an adaptive learning rate can reduce that burden but not eliminate it. As an example, I somehow got it into my head that the original paper had used a discount factor, gamma, of 0.99. I tried a run of one million frames (which took eight hours, done overnight) and got nothing. It did no better than random. I had a suspicion that it was because the discount factor was flattening out any useful information so I tried 0.999 instead. This simple change was enough to train an agent that had clearly learnt to move its paddle in front of the ball and could return it a reasonable amount of the time.
I looked at a benchmark comparison by OpenAI to get an idea of how long I’d need to run DQN for. It looks as if DQN had basically converged on Pong after a million frames. Mine had not, so I tried a longer run of three million frames. To facilitate this I spent a long time writing code that would allow me to pause a run, which I was too afraid to use when it came to it because I was afraid it would subtly break the run. Even after the full three million frames my agent still lost most of the time, but put up a good fight and won fairly regularly. I suspect that with different hyperparameters such as no discount rate and a slightly higher learning rate (and possibly a larger experience buffer, the size of which is also technically a hyperparameter) I could achieve better performance, but I don’t want to get caught in the same trap of repeatedly making small changes to something that basically works.
Overall I learned a lot about the intricacies of implementing a deep RL model and some of the potential failure modes. I spent at least 19 hours working on it, plus all of the tweaking and plotting I did between runs. I’m interested to see now how much time reading Pong from Pixels would have saved me, and if I still think it would have been ‘cheating’.
After reading
I see now that Karpathy’s blog post follows a different path and does not cover DQN. The post is excellent at building intuition for policy gradients and for the RL approach to the credit assignment problem in general. I feel like his post would not be sufficient to solve Pong from scratch without copying his code verbatim. I will admit that I was expecting it to be more in-depth with step by step instructions, but the fault lies with my expectations and not with the post.
I certainly don’t think it would have been ‘cheating’ to read ‘Pong from Pixels’ ahead of time, especially as it uses a policy gradient method rather than a value function one. I would recommend reading it but not being discouraged if you still don’t understand how it works afterwards. Neither his post nor mine gives enough detail to do it all from scratch. However if it does get you interested, in my experience Sutton and Barto’s book is a great place to start.
All my (rather messy) code is available on GitHub. This is my first attempt at a piece of writing like this so if you have any comments on content and/or style please do let me know .