In partially observable environments, stochastic policies can be optimal

Stuart_Armstrong

I always had the informal impression that the optimal policies were deterministic (choosing the best option, rather than some mix of options). Of course, this is not the case when facing other agents, but I had the impression this would hold when facing the environment rather that other players.

But stochastic policies can also be needed if the environment is partially observable, at least if the policy is Markov (memoryless). Consider the following POMDP (partially observable Markov decision process):

There are two states, 1a and 1b, and the agent cannot tell which one they're in. Action A in state 1a and B in state 1b, gives a reward of -R and keeps the agent in the same place. Action B in state 1a and A in state 1b, gives a reward of R and moves the agent to the other state.

The returns for the two deterministic policies - A and B - are -R every turn except maybe for the first. While the return for the stochastic policy of 0.5A + 0.5B is 0 per turn.

Of course, if the agent can observe the reward, the environment is no longer partially observable (though we can imagine the reward is delayed until later). And the general policy of "alternate A and B" is more effective that the 0.5A + 0.5B policy. Still, that stochastic policy is the best of the memoryless policies available in this POMDP.

The returns for the two deterministic policies - A and B - are -R every turn except maybe for the first. While the return for the stochastic policy of 0.5A + 0.5B is 0 per turn.

Take a game with a mixed strategy Nash equilibrium. If you and the other player follow this, using source of randomness that remain random for the other player, then it is never to your advantage to deviate from this. You play this game, again and again, against another player or against the environment.

Consider an environment in which the opponent's strategies are in an evolutionary arms race, trying to best beat you; this is an environmental model. Under this, you'd tend to follow the Nash equilibrium on average, but, at (almost) any given turn, there's a deterministic choice that's a bit better than being stochastic, and it's determined by the current equilibrium of strategies of the opponent/environment.

However, if you're facing another player, and you make deterministic choices, you're vulnerable if ever they figure out your choice. This is because they can peer into your algorithm, not just track your previous actions. To avoid this, you have to be stochastic.

This seems like a potentially relevant distinction.

12

In partially observable environments, stochastic policies can be optimal

12

12

12

In partially observable environments, stochastic policies can be optimal

12

12