Note also that is maximized when has full support on the distribution of and when has a high average on . That is, it's at most from maximized when is times a delta function on an -maximizing point, plus times the distribution of .
So essentially corresponds to a raw maximizer, and for interpolates between maximizing and softmax.
Summary: Given that both imitation and maximization have flaws, it might be reasonable to interpolate between these two extremes. It's possible to use Rényi divergence (a family of measures of distance between distributions) to define a family of these interpolations.
Rényi divergence is a measure of distance between two probability distributions. It is defined as
Dα(P||Q)=1α−1logEx∼P⎡⎣(P(X)Q(X))α−1⎤⎦
with α≥0. Special values D0,D1,D∞ can be filled in by their limits.
Particularly interesting values are D1(P||Q)=DKL(P||Q) and D∞(P||Q)=logmaxxP(x)Q(x).
Consider some agent choosing a distribution P over actions to simultaneously maximize a score function s and minimize Rényi divergence from some base distribution Q. That is, score P according to
vα(P)=EX∼P[s(X)]−γDα(P||Q)
where γ>0 controls how much the secondary objective is emphasized. Define P∗α=argmaxPvα(P). We have P∗1(x)∝Q(x)es(x)/γ, and P∗∞ is a quantilizer with score function s and base distribution Q (with the amount of quantilization being some function of γ, Q, and s). For 1<α<∞, P∗α will be some interpolation between P∗1 and P∗∞.
It's not necessarily possible to compute vα(P). To approximate this quantity, take samples x1,...,xn∼P and compute
1nn∑i=1s(xi)−1α−1log⎛⎝1nn∑i=1(P(xi)Q(xi))α−1⎞⎠
Of course, this requires P and Q to be specified in a form that allows efficiently estimating probabilities of particular values. For example, P and Q could both be variational autoencoders.
As α approaches 1, this limits to
1nn∑i=1s(xi)−n∑i=1logP(xi)Q(xi)=1nn∑i=1(s(xi)−logP(xi)+logQ(xi))
As α approaches ∞, this limits to
1nn∑i=1s(xi)−logmaxiP(xi)Q(xi)
Like a true quantilizer, a distribution P trained to maximize this value (an approximate quantilizer) will avoid assigning much higher probability to any action than Q does.
These approximations yield training objectives for agents which will interpolate between imitating Q and maximizing s. What do we use these for? Patrick suggested that Q could be an estimation of the distribution of actions a human would take (trained using something like this training procedure). Then, the distribution P maximizing the combined objective vα will try to maximize score in a somewhat human-like way; it will interpolate between imitation and score-maximization.
There are problems, though. Suppose a human and an AI can both solve Sudoku, but the AI can't solve it the way a human would. Suppose the AI trains a distribution Q over ways of filling out the puzzle to imitate the human. Q will usually not solve the puzzle, since the AI can't solve the puzzle the way a human would. Suppose the AI is choosing a distribution P over ways of filling out the puzzle to maximize a combined objective based on solving the puzzle and having low Renyi divergence from Q. If α=∞, then P will be an approximate quantilizer with base distribution Q, so it is unlikely to solve the puzzle unless γ is very low (since Q very rarely solves the puzzle). With α<∞, there is not much of a guarantee that the AI is solving the puzzle the way a human would; unlike a quantilizer, a distribution trained with α<∞ may assign much higher probability to some ways of filling out the puzzle than Q does. Something like meeting halfway might be necessary to ensure that the AI solves the problem in a humanlike way.