This post is the first in a series of notes on Reinforcement Learning.
In the past few months, I have been learning about this topic, and I have found it completely fascinating. Aside from the computational beauty and usefulness of the algorithms, I find that it provides an interesting framework to think about how I approach learning from experience in my own life.
Reinforcement Learning is a branch of AI that tries to develop agents that can learn interactively from real experience. The intelligent agent is assumed to have a set of sensors by which it can observe the world, a set of actuators by which it may act on the world, and a way to perceive a reward signal. The reward signal may be positive or negative, telling the agent whether its behavior is desirable or undesirable.
The reinforcement learning problem is fundamentally different from supervised learning problems such as computer vision. In computer vision, the goal is to learn a mapping from images to labels, and there is a clear error signal at every attempt: the difference between the proposed label and the true label. This means that computer vision has a clear specification of how to do the task. In contrast, reinforcement learning begins with a high-level description of “what” to do, but allows the agent to experiment and learn from experience the best “how”. This is especially useful in situations in which we humans may not be able to clearly articulate how we perform a task, such as playing chess, which relies heavily on intuition at each move, cases in which we expect the AI system to be able to eventually outperform human knowledge, or even cases in which we do not know how to do the task at all, such as folding complex proteins.
The key challenge is that rewards are usually given as a function of being in a specific state, but the agent may need to exhibit complex behaviors over time to be able to achieve such a state. For example,
A chess-playing agent will receive a reward of +1 for checkmating its opponent, -1 for being checkmated, and zero for any other circumstance. However, the agent must actually play an entire game of chess before it may observe any actual rewards. Somehow, it needs to learn which types of moves in which positions are more likely to lead to victory.
A cleaning robot receives a reward every time it completely cleans a room. However, the robot needs to completely sweep and mop the floors before it can see any reward. Somehow, the robot must learn sequences of behaviors that perform cleaning tasks.
This leads to several challenges that are unique to the field of reinforcement learning:
Exploration. The learning agent needs to find out how to obtain rewards. It literally needs to try to do random things until it starts to see rewards. Then, it can start to optimize for those rewards.
The exploration-exploitation trade-off. Once the agent has started to see rewards, we are immediately faced with a dilemma: how aggressively should the agent optimize for the rewards it has seen, versus how much should it continue to explore to search for potentially greater rewards?
The credit assignment problem. As the agent interacts with its environment and receives rewards, how can it determine which aspects of its behavior were most relevant to the observed rewards? This is the problem of correctly learning which sequences of states and actions lead to good outcomes
Reward hacking. Reward hacking is Goodhart’s Law applied to AI. Sometimes, when we create a reward function for our learning agent, we find that we did not clearly articulate the exact thing we wanted the agent to learn. In general, correctly specifying the behavior we want is a surprisingly difficult problem. For example, if we have a cleaning robot, we may reward it for every piece of trash it puts into a trash can. Reward hacking will result in a robot that creates its own trash by destroying objects in the room, then putting them into the trash can to collect its reward.
All of the challenges described above are central to both the algorithmic complexity of RL, as well as to the difficulties of developing safe learning agents. When a cleaning robot is exploring the actions it can take in a new room, we would like to make sure that it will not try to stick its body parts into an electric outlet, which could result in a destructive fire. An agent that fails to properly exploration and exploitation could result in either a robot that is “depressed”, sitting in a corner gathering small amounts of reward, or a robot that is “restless”, always exploring and never committing to anything. Incorrect credit assignment can lead to behaviors that are remarkably similar to superstitions. Reward hacking can famously result in paperclip manufacturing engines that dismantle the solar system in their quest for making more paperclips.
The next post in this series will talk about the assumptions and framing of reinforcement learning. We’ll discuss intuitions around the Markov property, and the decomposition of the problem into policies and value functions.
(cross-post from my blog)
This post is the first in a series of notes on Reinforcement Learning.
In the past few months, I have been learning about this topic, and I have found it completely fascinating. Aside from the computational beauty and usefulness of the algorithms, I find that it provides an interesting framework to think about how I approach learning from experience in my own life.
Reinforcement Learning is a branch of AI that tries to develop agents that can learn interactively from real experience. The intelligent agent is assumed to have a set of sensors by which it can observe the world, a set of actuators by which it may act on the world, and a way to perceive a reward signal. The reward signal may be positive or negative, telling the agent whether its behavior is desirable or undesirable.
The reinforcement learning problem is fundamentally different from supervised learning problems such as computer vision. In computer vision, the goal is to learn a mapping from images to labels, and there is a clear error signal at every attempt: the difference between the proposed label and the true label. This means that computer vision has a clear specification of how to do the task. In contrast, reinforcement learning begins with a high-level description of “what” to do, but allows the agent to experiment and learn from experience the best “how”. This is especially useful in situations in which we humans may not be able to clearly articulate how we perform a task, such as playing chess, which relies heavily on intuition at each move, cases in which we expect the AI system to be able to eventually outperform human knowledge, or even cases in which we do not know how to do the task at all, such as folding complex proteins.
The key challenge is that rewards are usually given as a function of being in a specific state, but the agent may need to exhibit complex behaviors over time to be able to achieve such a state. For example,
This leads to several challenges that are unique to the field of reinforcement learning:
All of the challenges described above are central to both the algorithmic complexity of RL, as well as to the difficulties of developing safe learning agents. When a cleaning robot is exploring the actions it can take in a new room, we would like to make sure that it will not try to stick its body parts into an electric outlet, which could result in a destructive fire. An agent that fails to properly exploration and exploitation could result in either a robot that is “depressed”, sitting in a corner gathering small amounts of reward, or a robot that is “restless”, always exploring and never committing to anything. Incorrect credit assignment can lead to behaviors that are remarkably similar to superstitions. Reward hacking can famously result in paperclip manufacturing engines that dismantle the solar system in their quest for making more paperclips.
The next post in this series will talk about the assumptions and framing of reinforcement learning. We’ll discuss intuitions around the Markov property, and the decomposition of the problem into policies and value functions.