I always had the informal impression that the optimal policies were deterministic
Really? I wouldn't have ever thought that at all. Why do you think you thought that?
when facing the environment rather that other players. But stochastic policies can also be needed if the environment is partially observable
Isn't kind of what a player is? Part of the environment with a strategy and only partially observable states?
Although for this player, don't you have an optimal strategy, except for the first move? The Markov "Player" seems to like change.
Isn't this strategy basically optimal? ABABABABABAB... Deterministic, just not the same every round. Am I missing something?
ABABABABABAB...
It's deterministic, but not memoryless.
But it really does seem that there is a difference between facing an environment and another player - the other player adapts to your strategy in a way the environment doesn't. The environment only adapts to your actions.
I think for unbounded agents facing the environment, a deterministic policy is always optimal, but this might not be the case for bounded agents.
I always had the informal impression that the optimal policies were deterministic (choosing the best option, rather than some mix of options). Of course, this is not the case when facing other agents, but I had the impression this would hold when facing the environment rather that other players.
But stochastic policies can also be needed if the environment is partially observable, at least if the policy is Markov (memoryless). Consider the following POMDP (partially observable Markov decision process):
There are two states, 1a and 1b, and the agent cannot tell which one they're in. Action A in state 1a and B in state 1b, gives a reward of -R and keeps the agent in the same place. Action B in state 1a and A in state 1b, gives a reward of R and moves the agent to the other state.
The returns for the two deterministic policies - A and B - are -R every turn except maybe for the first. While the return for the stochastic policy of 0.5A + 0.5B is 0 per turn.
Of course, if the agent can observe the reward, the environment is no longer partially observable (though we can imagine the reward is delayed until later). And the general policy of "alternate A and B" is more effective that the 0.5A + 0.5B policy. Still, that stochastic policy is the best of the memoryless policies available in this POMDP.