Summary: It's possible to set up an zero-sum game between two agents so that, in any Nash equilibrium, one agent picks a policy to optimize a particular objective subject to some constraints on expected features of the state resulting from the policy. This seems potentially useful for getting an approximate agent to maximize some objective subject to constraints.
Consider the following optimization problem:
Maximize f(π) subject to E[a(X)|do(π)]⪰b.
where π is a policy, f maps policies to reals, X∈X is a random
variable representing the state of the world, a∈X→Rn is a feature vector,
and b∈Rn is a vector. This objective instructs the agent to maximize some objective
f(π) while ensuring that random variables a1(X),...,an(X) have expectations at least b1,...,bn respectively.
How might we design a agent that solves the constrained optimization problem, if all we have available are agents
that optimize unconstrained objectives? Consider a game between two agents, an actor and a critic:
The critic picks Lagrange multipliers λ∈Rn,λ⪰0.
The actor picks a policy π, which results in world state X.
The actor gets utility f(π)+λ⋅(a(X)−b). The critic gets the negation of this utility.
Let's assume the critic and actor pick their λ and π simultaneously, and attempt to find the Nash equilibrium.
Since the critic has an infinite action space, there is not guaranteed to be a Nash equilibrium.
To make the problem easier, let's assume that the set of allowed policies is
convex, and f is concave (that is, for any two policies π1 and π2,
and θ∈(0,1), then if π3 is the policy formed by first picking
π1 with probability θ or π2 with probability 1−θ and
then running that policy, f(π3)≥θf(π1)+(1−θ)f(π2)).
We will prove:
Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.
This is due to Lagrangian duality; the critic is picking Lagrange multipliers.
It's possible that a setup like this results in reasonable behavior if we use approximate agents instead of
optimal agents (and perhaps place some high absolute limit on λ), which would be the main
practical application of this idea, though I'm not sure if it does.
Proof of theorem
The actor's expected utility is gλ(π):=f(π)+λ⋅(E[a(X)|do(π)]−b).
With the assumptions above, we need only consider pure Nash equilibria.
The critic's utility is linear in λ and the actor's utility
is concave in π; thus a pure strategy λ weakly dominates any mixed strategy with mean λ, and a policy π weakly dominates any distribution over policies whose mean is π.
First consider the game from the critic's perspective. Consider a few different cases:
Suppose E[ai(X)|do(π)]<bi for some i. Then the critic gets more expected utility the higher λi is;
thus there is no best-response λi value. Therefore, no Nash equilibria have this property.
Suppose E[ai(X)|do(π)]=bi for some i. Then the critic is indifferent about λi and may set it to anything.
Suppose E[ai(X)|do(π)]>bi for some i. Then the critic must set λi=0 to maximize expected utility.
These properties correspond to the primal feasibility and complementary slackness KKT conditions.
Now let's consider the game from the actor's perspective. The actor will pick π to maximize
gλ(π):=f(π)+λ⋅(E[a(X)|do(π)]−b)
Let π∗ solve the original optimization problem: it maximizes f(π∗) subject to
E[a(X)|do(π∗)]⪰b. Now observe that, in any Nash equilibrium:
f(π)≤f(π∗). This is because π satisfies the constraints of the original optimization problem and π∗ maximizes f subject to the constraints.
f(π)≥f(π∗). This is because we have f(π′)≤gλ(π′) for all π′ satisfying the constraints; in particular, f(π∗)≤gλ(π∗)≤gλ(π)=f(π).
So in any Nash equilibrium, we have
Property 3:f(π)=f(π∗)
Properties 1 and 3 together imply the theorem:
Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.
When does a Nash equilibrium exist?
We've shown a property that all Nash equilibria of this game satisfy, but is a Nash equilibrium guaranteed to exist? Whenever some policy satisfies the constraints of the original optimization problem, I think one does, but I haven't been able to prove this.
There's no Nash equilibrium when no π satisfies the constraints; in that case the critic would always want to set the λi value corresponding to a violated constraint to an extremeley high value.
Example: Secondary objectives
Consider the following optimization problem:
Maximize f(π) subject to E[a(X)|do(π)]≥b.
In effect, an agent solving this optimization problem will ensure that it solves the primary objective
of getting a high-enough expected a(X) value, and will otherwise attempt to optimize the secondary objective f(π).
In the game corresponding to this optimization problem, the critic picks λ to be the exchange rate between the primary and secondary objectives that causes the actor
to maximize the secondary objective subject to getting enough of the primary objective.
It's possible
that something like this could be used for mild optimization, though I haven't worked out the details. We would want to set f to be something simple that we're pretty comfortable with an AI approximately maximizing under the constraint, and I haven't found anything of this form yet.
Suppose we want to approximately sample from an exponential familypη(x)=exp(η⋅ϕ(x)−g(η))
with η set to maximize likelihood of the training data x1,...,xm. Define μ(q):=EX∼q[ϕ(X)],
and μdata:=1m∑mi=1ϕ(xi).
We would like to obtain a distribution q that maximizes H(q) subject to μ(q)=μdata.
By the maximum entropy theorem,
the optimal q will be pη∗, where η∗ is chosen to maximize the likelihood of the data according to pη∗.
The actor gets utility −logq(X)+η⋅(ϕ(X)−μdata); the critic gets the negation of this utility.
Note that −logq(X) is an unbiased estimate of H(q), so we can consider H(q) to be the objective to be optimized
subject to the constraints μ(q)=μdata.
At any Nash equilibrium, we will have:
The actor picks q:=pη∗
The critic picks η:=η∗
The first fact is proven as a special case of the original theorem (plus the maximum entropy theorem); the second fact is not difficult to show given the first fact.
I believe this approach is similar to contrastive divergence. To recover the power of the original generative adversarial networks, perhaps ϕ ought to be features output by a neural net. An advantage of this approach over generative adversarial networks is that the actor is penalized for differing
from the training distribution in a hard-to-detect way (such as by plagiarizing
from training data, or encoding steganographic messages; this is similar to the informed oversight problem), because the
exponential family distribution pη∗ is the uniquely optimal solution
(although this doesn't guarantee that the approximate solution will be very
close to pη∗).
Unfortunately, this approach requires estimating q(X), so q
must be represented as a variational
autoencoder or something similar. But if we're using variational autoencoders, then we might as well directly train a generative model. So the approach I'm sketching here is probably not directly useful on its own (though I can see it being inspiration for a more useful approach to training generative models in a safely scalable way).
Summary: It's possible to set up an zero-sum game between two agents so that, in any Nash equilibrium, one agent picks a policy to optimize a particular objective subject to some constraints on expected features of the state resulting from the policy. This seems potentially useful for getting an approximate agent to maximize some objective subject to constraints.
Consider the following optimization problem:
where π is a policy, f maps policies to reals, X∈X is a random variable representing the state of the world, a∈X→Rn is a feature vector, and b∈Rn is a vector. This objective instructs the agent to maximize some objective f(π) while ensuring that random variables a1(X),...,an(X) have expectations at least b1,...,bn respectively.
How might we design a agent that solves the constrained optimization problem, if all we have available are agents that optimize unconstrained objectives? Consider a game between two agents, an actor and a critic:
Let's assume the critic and actor pick their λ and π simultaneously, and attempt to find the Nash equilibrium. Since the critic has an infinite action space, there is not guaranteed to be a Nash equilibrium.
To make the problem easier, let's assume that the set of allowed policies is convex, and f is concave (that is, for any two policies π1 and π2, and θ∈(0,1), then if π3 is the policy formed by first picking π1 with probability θ or π2 with probability 1−θ and then running that policy, f(π3)≥θf(π1)+(1−θ)f(π2)).
We will prove:
Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.
This is due to Lagrangian duality; the critic is picking Lagrange multipliers. It's possible that a setup like this results in reasonable behavior if we use approximate agents instead of optimal agents (and perhaps place some high absolute limit on λ), which would be the main practical application of this idea, though I'm not sure if it does.
Proof of theorem
The actor's expected utility is gλ(π):=f(π)+λ⋅(E[a(X)|do(π)]−b).
With the assumptions above, we need only consider pure Nash equilibria. The critic's utility is linear in λ and the actor's utility is concave in π; thus a pure strategy λ weakly dominates any mixed strategy with mean λ, and a policy π weakly dominates any distribution over policies whose mean is π.
First consider the game from the critic's perspective. Consider a few different cases:
So in any Nash equilibrium we must have:
Property 1: E[a(X)|do(π)]⪰b
Property 2: λ⋅(E[a(X)|do(π)]−b)=0; equivalently, gλ(π)=f(π)
These properties correspond to the primal feasibility and complementary slackness KKT conditions.
Now let's consider the game from the actor's perspective. The actor will pick π to maximize
gλ(π):=f(π)+λ⋅(E[a(X)|do(π)]−b)
Let π∗ solve the original optimization problem: it maximizes f(π∗) subject to E[a(X)|do(π∗)]⪰b. Now observe that, in any Nash equilibrium:
So in any Nash equilibrium, we have
Property 3: f(π)=f(π∗)
Properties 1 and 3 together imply the theorem:
Theorem: in any Nash equilibrium, the actor's policy solves the original optimization problem.
When does a Nash equilibrium exist?
We've shown a property that all Nash equilibria of this game satisfy, but is a Nash equilibrium guaranteed to exist? Whenever some policy satisfies the constraints of the original optimization problem, I think one does, but I haven't been able to prove this.
There's no Nash equilibrium when no π satisfies the constraints; in that case the critic would always want to set the λi value corresponding to a violated constraint to an extremeley high value.
Example: Secondary objectives
Consider the following optimization problem:
In effect, an agent solving this optimization problem will ensure that it solves the primary objective of getting a high-enough expected a(X) value, and will otherwise attempt to optimize the secondary objective f(π).
In the game corresponding to this optimization problem, the critic picks λ to be the exchange rate between the primary and secondary objectives that causes the actor to maximize the secondary objective subject to getting enough of the primary objective.
It's possible that something like this could be used for mild optimization, though I haven't worked out the details. We would want to set f to be something simple that we're pretty comfortable with an AI approximately maximizing under the constraint, and I haven't found anything of this form yet.
Example: Generative adversarial exponential families
(note: this probably isn't actually useful)
Suppose we want to approximately sample from an exponential family pη(x)=exp(η⋅ϕ(x)−g(η)) with η set to maximize likelihood of the training data x1,...,xm. Define μ(q):=EX∼q[ϕ(X)], and μdata:=1m∑mi=1ϕ(xi).
We would like to obtain a distribution q that maximizes H(q) subject to μ(q)=μdata. By the maximum entropy theorem, the optimal q will be pη∗, where η∗ is chosen to maximize the likelihood of the data according to pη∗.
Consider the following game, which resembles generative adversarial networks:
Note that −logq(X) is an unbiased estimate of H(q), so we can consider H(q) to be the objective to be optimized subject to the constraints μ(q)=μdata.
At any Nash equilibrium, we will have:
The first fact is proven as a special case of the original theorem (plus the maximum entropy theorem); the second fact is not difficult to show given the first fact.
I believe this approach is similar to contrastive divergence. To recover the power of the original generative adversarial networks, perhaps ϕ ought to be features output by a neural net. An advantage of this approach over generative adversarial networks is that the actor is penalized for differing from the training distribution in a hard-to-detect way (such as by plagiarizing from training data, or encoding steganographic messages; this is similar to the informed oversight problem), because the exponential family distribution pη∗ is the uniquely optimal solution (although this doesn't guarantee that the approximate solution will be very close to pη∗).
Unfortunately, this approach requires estimating q(X), so q must be represented as a variational autoencoder or something similar. But if we're using variational autoencoders, then we might as well directly train a generative model. So the approach I'm sketching here is probably not directly useful on its own (though I can see it being inspiration for a more useful approach to training generative models in a safely scalable way).