Given some compact metric space of options X, if U:X→R is a bounded function, {μ|∀ν∈ΔX:μ(U)≥ν(U)}=Δ{x|∀y∈X:U(x)≥U(y)}
PROOF: To do this, we must show that anything in the first set is in the second, and vice-versa. So, assume μ is in the first set. Then for any ν, ν(U)≤μ(U). However, if μ is not in the second set, then it has probability-mass on some points x where there's a y s.t. U(y)>U(x). Moving that new probability mass to points of higher utility, we'd construct a new probability distribution ν s.t. ν(U)>μ(U), which is impossible. So μ must be in the second set.
In the reverse direction, any probability distribution μ supported over the maximum of U (let x be an arbitrary point in it) has μ(U)=U(x). Thus, for any ν, it's supported entirely over points with equal or lesser utility than this maximum value, so μ(U)≥ν(U). Both subset inclusion directions have been proved, so the equality holds.
Theorem 1: Fundamental Theorem of Infrafunctions
There is a bijection between concave upper-semicontinuous functions of type ΔX→R∪{−∞}, and closed convex upper-complete sets of continuous functions X→R.
Proof: For the reverse direction, going from sets to functions, letting ^F denote the set and F denote the function, we have that
F(μ):=inff∈^Fμ(f)
With this definition being made, let's verify that the resulting function fulfills the indicated property. For verifying concavity, we have
F(pμ+(1−p)ν)=inff∈^F(pμ+(1−p)ν)(f)≥pinff∈^Fμ(f)+(1−p)inff∈^Fν(f)=pF(μ)+(1−p)F(ν)
For verifying upper-semicontinuity, let μn limit to μ.
Now, for any particular function f∗, due to μn limiting to μ, we have
limsupn→∞μn(f∗)=μ(f∗)
So, for any f∗ where μ(f∗) is an (arbitrarily close) approximation to inff∈^Fμ(f), we can go
F(μ)=inff∈^Fμ(f)≃μ(f∗)=limsupn→∞μn(f∗)≥limsupn→∞inff∈^Fμn(f)=limsupn→∞F(μn)
And so upper-semicontinuity is established, that
F(μ)≥limsupn→∞F(μn)
For the forwards direction, going from functions to sets, letting F denote the function, the corresponding set of functions is
{f|∀μ∈ΔX:μ(f)≥F(μ)}
We must show that the result is closed, convex, and upper-complete.
For closure, if fm limits to f, and all the fm are in the set (so μ(fm)≥F(μ) for all m and μ) and μ is arbitrary, then due to fm limiting to f, we have that μ(f)=limm→∞μ(fm)≥F(μ) and this argument works for all μ, so f is in the indicated set.
For convexity, our job is to show that if f and g are in the set, pf+(1−p)g are in the set too. Fix an arbitrary μ.
μ(pf+(1−p)g)=pμ(f)+(1−p)μ(g)≥pF(μ)+(1−p)F(μ)=F(μ)
And convexity is shown. For upper-completeness, assume that g≥f according to the usual ordering on functions. Then
μ(g)≥μ(f)≥F(μ)
Now, all that remains is to show that the two translation directions compose to identity in both directions.
For function-to-set-to-function being identity, a sufficient property for this to hold is that, for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f)
Why? Well, every function f that lies in the set ^F has the property that ∀μ:μ(f)≥F(μ). Therefore, regardless of μ, we have that
∀μ′:inff:∀μ,μ(f)≥F(μ)μ′(f)≥F(μ′)
Or, rephrasing
∀μ′:inff∈^Fμ′(f)≥F(μ′)
And, \emph{if} we had the property that for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f), that would mean that regardless of μ′, you could find functions f that would lie in ^F, and where μ′(f) would lie arbitrarily close to F(μ′). This would get equality, that
∀μ′:inff:∀μ,μ(f)≥F(μ)μ′(f)=F(μ′)
Or, rephrasing,
∀μ′:inff∈^Fμ′(f)=F(μ′)
Or, rephrasing,
inff∈^F=F
Ie, going from function to set to function again is identity.
As for the other direction, in order for set-to-function-to-set to be identity, the ⊆ inclusion direction is pretty easy to prove outright. For any f in a set of functions, it'll exceed the inf of that set of functions, and so it'll end up being in the set generated by that inf.
The ⊇ inclusion direction is harder. We'd need to show that given any closed convex upper-complete set of continuous bounded functions, ^F, any function g which is outside of that set has some μ where μ(g)<inff∈^Fμ(f). The reason that showing that helps is it establishes that g cannot be in the set induced by the inf of everything in ^F. Therefore, the set induced by the inf function can't be a strict superset of ^F, because any g in the former set and not the latter one can be used to generate a witness that it's not in the former set.
So, our two missing results to prove to establish the entire isomorphism theorem are:
1: Establishing that for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f).
2: Establishing that for any closed convex upper-complete set of continuous bounded functions, ^F, any function g which is outside of that set has some μ where μ(g)<inff∈^Fμ(f).
For both of these, we'll use the Hahn-Banach theorem. For the first one, let the set A (a subset of M±(X)⊕R) be {y,μ|y≤F(μ)}, and the set B be the set consisting of the single point x,μ′. The set B is clearly convex, compact, and nonempty. That leaves showing that A is closed, convex, and nonempty.
What we'll do is prove our desired result outright if the set A is empty. It's the region under the graph of F, which can only be empty if F=−∞, in which case you could take any function f and subtract a sufficiently large constant from it to get a function whose expectation w.r.t μ was below the x you named.
So, we may safely assume that the set A is nonempty. Closure works because if yn limits to y and μn limits to μ, we have that
y=limn→∞yn≤limsupn→∞F(μn)≤F(μ)
by upper-semicontinuity. And convexity works because
py+(1−p)y′≤pF(μ)+(1−p)F(μ′)≤F(pμ+(1−p)μ′)
due to concavity of F.
We must show that the two sets aren't disjoint, which is easy because the single point has x>F(μ′). Thus the Hahn-Banach hyperplane separation theorem may be used to find an continuous affine hyperplane with B on the top side of it, and A on the top side of it. All continuous affine functions M±(X)→R can be decomposed as a continuous linear functional M±(X)→R (which can be interpreted as expectation w.r.t a bounded continuous function f), plus a constant. So, there's some bounded continuous function f and constant number b s.t. ∀μ,y∈A:y≤μ(f)+b (ie, A is below the hyperplane induced by the affine function induced by f,b), and where μ′(f)+b≤x (ie, the point B consisting of μ′,x lies above the hyperplane induced by the affine function induced by f,b).
Now, specializing to the case where y=F(μ), and observing that μ(f+b)=μ(f)+b for probability distributions, we can take f+b to be our function of interest, and get that
∀μ∈ΔX:μ(f+b)=μ(f)+b≥F(μ)
and also, x≥μ′(f)+b=μ′(f+b). So, we've found a function with the indicated properties, outright proving the first of our two unsolved problems.
This leaves the second one. The set ^F is closed and convex, and g is just a single point, so we can invoke the Hahn-Banach theorem to get a separating linear functional. Some signed measure m where m(g)<inff∈^Fm(f).
We need to show that this signed measure is a measure. If there was any negative region in the signed measure, then there could be a continuous function g≥0 that acted as a massive spike in the negative region of m, and m(f+ag) (for very large constants a and some f∈^F) would be arbitrarily negative, while f+ag would lie in ^F by upper-closure, and we'd have a contradiction.
So, m is a measure. Multiplying it by a suitable nonnegative constant turns it into a probability distribution μ that still fulfills the property that μ(g)<inff∈^Fμ(f), and we're done.
Since our two missing part of the proof were cleaned up, we've proven the overall theorem.
Theorem 2: Lp Ball Theorem
For any ϵ>0, p∈[1,∞], f:X→R, and ν:ΔX, the infrafunction corresponding to
{g|(∫|f−g|pdν)1p≤ϵ} (Knightian uncertainty over the Lp ball of size ϵ centered at f, w.r.t ν), is
μ↦μ(f)−ϵ||dμdν||q where 1p+1q=1.
Further, given a function f, the optimal μ to pick is the distribution a⋅ν⋅max(f−b,0)p−1 for some constants a>0 and b≤max(f).
Proof: The Lp norm of a function w.r.t. ν will be denoted as ||g||p. The function associated with the worst-case amongst the Lp ball centered at f is
λμ.infg:||f−g||p≤ϵμ(g)
We can rewrite this as f plus a function h that has an Lp norm of ϵ or less. So we get the equivalent function
=λμ.infh:||h||p≤ϵμ(f+h)=λμ.infh:||h||p≤ϵμ(f)+μ(h)=λμ.μ(f)+infh:||h||p≤ϵμ(h)=λμ.μ(f)−suph:||h||p≤ϵμ(h)
Now, we'll show that suph:||h||p≤ϵμ(h)=||dμdν||q where 1p+1q=1, by showing that it's both ≤ and ≥ that quantity. For one inequality direction, we can go
suph:||h||p≤ϵμ(h)=suph:||h||p≤ϵ∫hdμ=suph:||h||p≤ϵ∫hdμdνdν
And then apply Holder's inequality.
≤suph:||h||p≤ϵ||h||p⋅||dμdν||q=ϵ||dμdν||q
And so, we have one inequality direction, that
suph:||h||p≤ϵμ(h)≤ϵ||dμdν||q
For the other inequality direction, we can let h be defined as ϵ⋅(dμdν)1p−1⎛⎜⎝∫(dμdν)pp−1dν⎞⎟⎠1p
Our first order of business is showing that the Lp norm of h is ϵ or less. So, compute it.
⎛⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜⎝∫⎛⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜⎝ϵ⋅(dμdν)1p−1(∫(dμdν)pp−1dν)1p⎞⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟⎠pdν⎞⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟⎠1p
Distribute the power in.
=⎛⎜
⎜
⎜⎝∫ϵp⋅(dμdν)pp−1∫(dμdν)pp−1dνdν⎞⎟
⎟
⎟⎠1p
Pull the ϵp, group and cancel out.
=(ϵp)1p=ϵ
So, this is a valid choice of function h. So, we can compute (by pulling the multiplicative constants out)
suph:||h||p≤ϵμ(h)≥μ⎛⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜⎝ϵ⋅(dμdν)1p−1(∫(dμdν)pp−1dν)1p⎞⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟⎠=ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠−1p∫(dμdν)1p−1dμ
Now, since
∫(dμdν)1p−1dμ=∫(dμdν)1p−1dμdνdν=∫(dμdν)pp−1dν
We can reexpress our old equation as
=ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠−1p⎛⎝∫(dμdν)pp−1dν⎞⎠pp=ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠p−1p=ϵ⋅(∫(dμdν)qdν)1q=ϵ||dμdν||q
So we have
suph:||h||p≤ϵμ(h)≥ϵ||dμdν||q
Since that's the other inequality direction, we have equality. Now, since we were last at
=λμ.μ(f)−suph:||h||p≤ϵμ(h)
we can proceed to
=λμ.μ(f)−ϵ||dμdν||q
And that addresses one part of the theorem, about the objective function we're minimizing.
Sadly, the last part, about the optimal μ, can only be done according to a physics level of rigor. Given two points x and y which the optimal μ assigns probability to, the increase in the objective function from adding a tiny little δ of probability mass to x must be the same as the increase from adding a tiny little δ of probability mass to y. This is because, otherwise, if x got a higher score than y, you could steal a tiny amount of probability mass from y, reallocate it to x, and you'd get a higher score.
So, we can ask, what is ∂∂x(μ↦μ(f)−ϵ||dμdν||q) for the optimal μ? We know that it will end up being equal for every x which is in the support of the optimal μ. Well, we can write it as
∂∂x(μ(f)−ϵ||dμdν||q)=∂∂x⎛⎝∫fdμ−ϵ(∫(dμdν)qdν)1q⎞⎠=∂∂x(∫fdμ)−∂∂x⎛⎝ϵ(∫(dμdν)qdν)1q⎞⎠=f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1∂∂x(∫(dμdν)qdν)=f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)∂∂x((dμdν)q)=f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)⋅1ν(x)q∂∂x(dμ)q=f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)⋅1ν(x)q⋅q⋅μ(x)q−1
And cancel
=f(x)−ϵ(∫(dμdν)qdν)1q−1(dμdν(x))q−1
Now, since ϵ(∫(dμdν)qdν)1q−1 is the same regardless of which x was picked, we can abbreviate it as α, and write things as
∂∂x(μ(f)−ϵ||dμdν||q)=f(x)−α(dμdν(x))q−1
As previously explained, this quantity must be the same for all x that are supported by the optimal μ. In fact, this quantity can be called b. So, for every x with support, we have that
b=f(x)−α(dμdν(x))q−1
We can rearrange this
α(dμdν(x))q−1=f(x)−b(dμdν(x))q−1=f(x)−bα(f(x)−b)dμdν(x)=(f(x)−bα)1q−1
And now we see that, since the left-hand side is always nonnegative, f(x)−b must also be nonnegative too. And for any x which fulfilled the above equation, it can be rearranged to show that this x must have its partial derivative in the x direction being the maximal value of b, as expected, and so μ would not be harmed by placing any probability mass on that x. So, without loss of generality, we can assume that μ is supported over all x which fulfill the above equation, and make it into an equation that's valid for \emph{all} x by going
dμdν(x)=(max(f(x)−b,0)α)1q−1
Rearrange.
μ(x)=ν(x)⋅(max(f(x)−b,0)α)1q−1
Now, (1α)1q−1 is a constant, so call it a. And since it's valid over all x, we can rephrase it as
μ=a⋅ν⋅max(f−b,0)1q−1
And now, since 1p+1q=1, we can rearrange this into q=pp−1, and so
1q−1=1pp−1−p−1p−1=11p−1=p−1
And so our equation rearranges into
μ=a⋅ν⋅max(f−b,0)p−1
As desired.
Theorem 3: Dynamic Consistency
For any environment e, finite history h, off-history policy π¬h, and infrafunctions U and V, we have that}
U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e)
\emph{Or, restating in words, selecting the after-h policy by argmaxing for U|e,h,π¬h makes an overall policy that outscores the overall policy made by selecting the after-h policy to argmax for V.
Proof:
U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)
Use π∗ to abbreviate the optimal policy for U|e,h,π¬h, and interacting with the environment e|h.
=U((π¬h∙π∗)⋅e)=minU∈^UE(π¬h∙π∗)⋅e[U]=minU∈^UE(π¬h∙π∗)⋅e[1¬hU]+E(π¬h∙π∗)⋅e[1hU]=minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ∗⋅(e|h)[U↓h]
Now, since π∗ is optimal for U|e,h,π¬h and interacting with e|h, that means that for all other post-h policies π′, we have that
(U|e,h,π¬h)(π∗⋅(e|h))≥(U|e,h,π¬h)(π′⋅(e|h))
Unpacking the definition of the updated infrafunction, since everything in the set associated with U|e,h,π¬h came from the set associated with U, this turns into
minU∈^UEπ∗⋅(e|h)[Eπ¬h⋅e[1h]⋅U↓h+Eπ¬h⋅e[1¬hU]]≥minU∈^UEπ′⋅(e|h)[Eπ¬h⋅e[1h]⋅U↓h+Eπ¬h⋅e[1¬hU]]
Reshuffling some expectations, this turns into
minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]Eπ∗⋅(e|h)[U↓h]≥minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]Eπ′⋅(e|h)[U↓h]
Let π′ be the policy that optimizes V(π′⋅(e|h)). Now, we were previously at
=minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ∗⋅(e|h)[U↓h]
but using the above inequality, we can proceed to
≥minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ′⋅(e|h)[U↓h]=minU∈^UE(π¬h∙π′)⋅e[1¬hU]+E(π¬h∙π′)⋅e[1hU]=minU∈^UE(π¬h∙π′)⋅e[U]=U((π¬h∙π′)⋅e)=U((π¬h∙argmaxπ′V(π′⋅(e|h)))⋅e)
And we're done, we've shown our desired inequality
U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e)
And the U, V, and everything else was arbitrary, so it's universally valid.
Proposition 2: Lp Double Integral Inequality
If f≥0, let ∮pfdμ be an abbreviation for (∫fpdμ)1p. Then for all p∈[−∞,1], and μ:ΔX and ν:ΔY and f:X×Y→R≥0, we have that
∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν
Here's why. Define the function g:Y→R≥0 as g(y)=(∫f(x,y)dμ)p−1. Reverse Holder's inequality, because p≤1, says that, for all x,
∫f(x,y)g(y)dν≥∮pf(x,y)dν⋅∮qg(y)dν
Where 1p+1q=1. So, using that inequality, we can derive that
∫f(x,y)g(y)dν∮qg(y)dν≥∮pf(x,y)dν
And so we derive that
∫∮pf(x,y)dνdμ≤∫∫f(x,y)g(y)dν∮qg(y)dνdμ
Pull the constant out, and group this into a double integral.
=(∮qg(y)dν)−1⋅∫∫f(x,y)g(y)dνdμ
Swap the order of double integration, and unpack what ∮q is.
=⎛⎝(∫g(y)qdν)1q⎞⎠−1⋅∫∫f(x,y)g(y)dμdν
Pull the constant g(y) partway out of the inner integral, and also collapse the two powers on the left-hand side.
=(∫g(y)qdν)−1q⋅∫g(y)∫f(x,y)dμdν
Now, since 1q+1p=1, solving for q, we get that it is equal to pp−1. Similarly, −1q=1p−1. Make those substitutions.
=(∫g(y)pp−1dν)1p−1⋅∫g(y)∫f(x,y)dμdν
Unpack what g(y) is.
=⎛⎜⎝∫((∫f(x,y)dμ)p−1)pp−1dν⎞⎟⎠1p−1⋅∫(∫f(x,y)dμ)p−1(∫f(x,y)dμ)dν
Simplify
=(∫(∫f(x,y)dμ)pdν)1p−1⋅∫(∫f(x,y)dμ)pdν
Cancel
=(∫(∫f(x,y)dμ)pdν)1p
Rewrite
=∮p∫f(x,y)dμdν
And we're done! We've shown that
∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν
For some values of p, like 0 or 1 or −∞, these quantities will be ill-defined, but this can be handled by taking limits of the p where this argument works.
Corollary 1: Lp-Averaging is Well-Defined
Given any distribution ν over a family of functions or infrafunctions, define the Lp-average of this family (for p∈[−∞,1]) as the function
μ↦∮pFi(μ)dν
If all infrafunctions Fi(μ) are ≥0, then the Lp-average will be an infrafunction as well.
Proof: The main thing we need to verify is concavity of the resulting function. So, fix some μ1 and μ2, and use q as our mixing parameter. Then,
q(∮pFidν)(μ1)+(1−q)(∮pFidν)(μ2)=q∮pFi(μ1)dν+(1−q)∮pFi(μ2)dν
Now, this can actually be interpreted as an ordinary integral outside of a p-integral. The outer integral is over the space of two points, and f(1,F)=F(μ1) and f(2,F)=F(μ2). So applying our inequality from the previous theorem, we get
≤∮pqFi(μ1)+(1−q)Fi(μ2)dν
And then apply concavity of each F.
≤∮pFi(qμ1+(1−q)μ2)dν
Concavity has been proved.
Theorem 4: Crappy Optimizer Theorem
For any selection process s where Q(s) fulfills the four properties, ∀f:Q(s)(f)=maxμ∈Ψμ(f) will hold for some closed convex set of probability distributions Ψ. Conversely, the function f↦maxμ∈Ψμ(f) for any closed convex set Ψ will fulfill the four properties of an optimization process.
It's already a theorem that any function ψ:(X→R)→R which fulfills the three properties of concavity, monotonicity, and ψ(af+c)=aψ(f)+c, has a representation as minimizing over a closed convex set of probability distributions, and that minimizing over a closed convex set of probability distributions will produce a functional which fulfills those four properties. It's somewhere in Less Basic Inframeasure Theory. Well, actually, I proved some stuff about ultradistributions and all the proofs were the same just flipping concave to convex, and minimization to maximization, in an extremely straightforward way. And I proved that those three properties (but with convex flipped to concave) were equivalent to minimizing over a closed convex set of probability distributions, so the same result should hold.
Very similar arguments prove that any function which fulfills the three properties of convexity, monotonicity, and being able to pull multiplicative and additive constants out, has a representation as maximizing over a closed convex set of probability distributions Ψ, and vice-versa.
So, that just leaves proving the three defining properties of an ultradistribution from the four properties assumed of the function Q(s), and vice-versa. As a recap, the four properties assumed were
P1:∀f,c∈R:Q(s)(f+c)=Q(s)(f)+cP2:∀f,a≥0:Q(s)(af)=aQ(s)(f)P3:∀f,g:Q(s)(f+g)≤Q(s)(f)+Q(s)(g)P4:∀f:f≤0→Q(s)(f)≤0
And the four defining properties of ultradistributions are
U1:∀f,g,p∈[0,1]:ψ(pf+(1−p)g)≤pψ(f)+(1−p)ψ(g)U2:∀f,g:f≤g→ψ(f)≤ψ(g)U3:∀f,c∈R,a≥0:ψ(af+c)=aψ(f)+c
Proof of U1 from P2 and P3:
Q(s)(pf+(1−p)g)≤Q(s)(pf)+Q(s)((1−p)g)=pQ(s)(f)+(1−p)Q(s)(g)
Proof of U2 from P3 and P4 and f≤gQ(s)(f)=Q(s)((f−g)+g)≤Q(s)(f−g)+Q(s)(g)≤Q(s)(g)
Proof of U3 from P1 and P2:
Q(s)(af+c)=Q(s)(af)+c=aQ(s)(f)+c
Now for the reverse direction!
Proof of P1 from U3:
ψ(f+c)=ψ(f)+c
Proof of P2 from U3:
ψ(af)=aψ(f)
Proof of P3 from U1 and U3:
ψ(f+g)=ψ(p1pf+(1−p)11−pg)≤pψ(1pf)+(1−p)ψ(11−pg)=ψ(f)+ψ(g)
Proof of P4 from U2 and U3 and f≤0ψ(f)≤ψ(0)=ψ(0⋅0)=0⋅ψ(0)=0
And done!
Proposition 3:
The space of infrafunctions is a □-algebra, with the function flatFX:□FX→FX being defined as λψ.λμ.ψ(λF.F(μ)).
We just need to show two commuting diagrams. For the square one, fix an arbitrary Ψ:□□FX and μ∈ΔX.
flatFX(flat□FX(Ψ))(μ)
Unpack the outer layer of flattening.
=flat□FX(Ψ)(λF.F(μ))
Unpack the next layer of flattening, via the usual way that flattening works for infradistributions.
=Ψ(λψ.ψ(λF.F(μ)))
Regroup this back up in a different way with how flatFX works.
=Ψ(λψ.flatFX(ψ)(μ))
Write this as the beta reduction of the more complicated term
=Ψ(λψ.(λF.F(μ))(flatFX(ψ)))
Write this as a pushforward along an infrakernel.
=flatFX∗(Ψ)(λF.F(μ))
Write this as a flattening.
=flatFX(flatFX∗(Ψ))(μ)
Now, since this holds for all μ, that means that
flatFX(flat□FX(Ψ))=flatFX(flatFX∗(Ψ))
And since this holds for all Ψ, this means that
flatFX∘flat□FX=flatFX∘flatFX∗
This demonstrates the first commutative diagram of an algebra of a monad, the one that looks like a square. Now for the forward-and-back digram composing to identity.
We need that the usual embedding e:FX→□FX given by e(F)=λg.g(F) where g:FX→R, has flatFX∘e=idFX. Let's compute it. Fix an arbitrary F:FX and a μ:ΔX. Then
flatFX(e(F))(μ)=flatFX(λg.g(F))(μ)=(λg.g(F))(λG.G(μ))=(λG.G(μ))(F)=F(μ)=idFX(F)(μ)
And this holds for all μ and F, so composing flattening and the usual embedding produces the identity, which is the second property to declare something a □-algebra.
Proposition 4:
All continuous infrakernels k:X→□Y induce a function FY→FX via G↦(μ↦G(k∗(μ))
We've got two things to verify. Namely, concavity, and upper-semicontinuity, of λμ.G(k∗(μ)). Upper-semicontinuity is easy, because if μn converges to μ, then k∗(μn) converges to k∗(μ), and then upper-semicontinuity of G takes over. For concavity, we have
G(k∗(pμ1+(1−p)μ2))=G(pk∗(μ1)+(1−p)k∗(μ2))≥pG(k∗(μ1))+(1−p)G(k∗(μ2))
And it's proved.
Proposition 1:
Given some compact metric space of options X, if U:X→R is a bounded function, {μ|∀ν∈ΔX:μ(U)≥ν(U)}=Δ{x|∀y∈X:U(x)≥U(y)}
PROOF: To do this, we must show that anything in the first set is in the second, and vice-versa. So, assume μ is in the first set. Then for any ν, ν(U)≤μ(U). However, if μ is not in the second set, then it has probability-mass on some points x where there's a y s.t. U(y)>U(x). Moving that new probability mass to points of higher utility, we'd construct a new probability distribution ν s.t. ν(U)>μ(U), which is impossible. So μ must be in the second set.
In the reverse direction, any probability distribution μ supported over the maximum of U (let x be an arbitrary point in it) has μ(U)=U(x). Thus, for any ν, it's supported entirely over points with equal or lesser utility than this maximum value, so μ(U)≥ν(U). Both subset inclusion directions have been proved, so the equality holds.
Theorem 1: Fundamental Theorem of Infrafunctions
There is a bijection between concave upper-semicontinuous functions of type ΔX→R∪{−∞}, and closed convex upper-complete sets of continuous functions X→R.
Proof: For the reverse direction, going from sets to functions, letting ^F denote the set and F denote the function, we have that F(μ):=inff∈^Fμ(f) With this definition being made, let's verify that the resulting function fulfills the indicated property. For verifying concavity, we have F(pμ+(1−p)ν)=inff∈^F(pμ+(1−p)ν)(f)≥pinff∈^Fμ(f)+(1−p)inff∈^Fν(f)=pF(μ)+(1−p)F(ν) For verifying upper-semicontinuity, let μn limit to μ. Now, for any particular function f∗, due to μn limiting to μ, we have limsupn→∞μn(f∗)=μ(f∗) So, for any f∗ where μ(f∗) is an (arbitrarily close) approximation to inff∈^Fμ(f), we can go F(μ)=inff∈^Fμ(f)≃μ(f∗)=limsupn→∞μn(f∗)≥limsupn→∞inff∈^Fμn(f)=limsupn→∞F(μn) And so upper-semicontinuity is established, that F(μ)≥limsupn→∞F(μn) For the forwards direction, going from functions to sets, letting F denote the function, the corresponding set of functions is {f|∀μ∈ΔX:μ(f)≥F(μ)} We must show that the result is closed, convex, and upper-complete.
For closure, if fm limits to f, and all the fm are in the set (so μ(fm)≥F(μ) for all m and μ) and μ is arbitrary, then due to fm limiting to f, we have that μ(f)=limm→∞μ(fm)≥F(μ) and this argument works for all μ, so f is in the indicated set.
For convexity, our job is to show that if f and g are in the set, pf+(1−p)g are in the set too. Fix an arbitrary μ. μ(pf+(1−p)g)=pμ(f)+(1−p)μ(g)≥pF(μ)+(1−p)F(μ)=F(μ) And convexity is shown. For upper-completeness, assume that g≥f according to the usual ordering on functions. Then μ(g)≥μ(f)≥F(μ)
Now, all that remains is to show that the two translation directions compose to identity in both directions.
For function-to-set-to-function being identity, a sufficient property for this to hold is that, for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f)
Why? Well, every function f that lies in the set ^F has the property that ∀μ:μ(f)≥F(μ). Therefore, regardless of μ, we have that ∀μ′:inff:∀μ,μ(f)≥F(μ)μ′(f)≥F(μ′) Or, rephrasing ∀μ′:inff∈^Fμ′(f)≥F(μ′) And, \emph{if} we had the property that for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f), that would mean that regardless of μ′, you could find functions f that would lie in ^F, and where μ′(f) would lie arbitrarily close to F(μ′). This would get equality, that ∀μ′:inff:∀μ,μ(f)≥F(μ)μ′(f)=F(μ′) Or, rephrasing, ∀μ′:inff∈^Fμ′(f)=F(μ′) Or, rephrasing, inff∈^F=F Ie, going from function to set to function again is identity.
As for the other direction, in order for set-to-function-to-set to be identity, the ⊆ inclusion direction is pretty easy to prove outright. For any f in a set of functions, it'll exceed the inf of that set of functions, and so it'll end up being in the set generated by that inf.
The ⊇ inclusion direction is harder. We'd need to show that given any closed convex upper-complete set of continuous bounded functions, ^F, any function g which is outside of that set has some μ where μ(g)<inff∈^Fμ(f). The reason that showing that helps is it establishes that g cannot be in the set induced by the inf of everything in ^F. Therefore, the set induced by the inf function can't be a strict superset of ^F, because any g in the former set and not the latter one can be used to generate a witness that it's not in the former set.
So, our two missing results to prove to establish the entire isomorphism theorem are:
1: Establishing that for every μ′∈ΔX and x>F(μ′), there exists a continuous function f s.t. ∀μ∈ΔX:μ(f)≥F(μ) and x≥μ′(f).
2: Establishing that for any closed convex upper-complete set of continuous bounded functions, ^F, any function g which is outside of that set has some μ where μ(g)<inff∈^Fμ(f).
For both of these, we'll use the Hahn-Banach theorem. For the first one, let the set A (a subset of M±(X)⊕R) be {y,μ|y≤F(μ)}, and the set B be the set consisting of the single point x,μ′. The set B is clearly convex, compact, and nonempty. That leaves showing that A is closed, convex, and nonempty.
What we'll do is prove our desired result outright if the set A is empty. It's the region under the graph of F, which can only be empty if F=−∞, in which case you could take any function f and subtract a sufficiently large constant from it to get a function whose expectation w.r.t μ was below the x you named.
So, we may safely assume that the set A is nonempty. Closure works because if yn limits to y and μn limits to μ, we have that y=limn→∞yn≤limsupn→∞F(μn)≤F(μ) by upper-semicontinuity. And convexity works because py+(1−p)y′≤pF(μ)+(1−p)F(μ′)≤F(pμ+(1−p)μ′) due to concavity of F.
We must show that the two sets aren't disjoint, which is easy because the single point has x>F(μ′). Thus the Hahn-Banach hyperplane separation theorem may be used to find an continuous affine hyperplane with B on the top side of it, and A on the top side of it. All continuous affine functions M±(X)→R can be decomposed as a continuous linear functional M±(X)→R (which can be interpreted as expectation w.r.t a bounded continuous function f), plus a constant. So, there's some bounded continuous function f and constant number b s.t. ∀μ,y∈A:y≤μ(f)+b (ie, A is below the hyperplane induced by the affine function induced by f,b), and where μ′(f)+b≤x (ie, the point B consisting of μ′,x lies above the hyperplane induced by the affine function induced by f,b).
Now, specializing to the case where y=F(μ), and observing that μ(f+b)=μ(f)+b for probability distributions, we can take f+b to be our function of interest, and get that ∀μ∈ΔX:μ(f+b)=μ(f)+b≥F(μ) and also, x≥μ′(f)+b=μ′(f+b). So, we've found a function with the indicated properties, outright proving the first of our two unsolved problems.
This leaves the second one. The set ^F is closed and convex, and g is just a single point, so we can invoke the Hahn-Banach theorem to get a separating linear functional. Some signed measure m where m(g)<inff∈^Fm(f).
We need to show that this signed measure is a measure. If there was any negative region in the signed measure, then there could be a continuous function g≥0 that acted as a massive spike in the negative region of m, and m(f+ag) (for very large constants a and some f∈^F) would be arbitrarily negative, while f+ag would lie in ^F by upper-closure, and we'd have a contradiction.
So, m is a measure. Multiplying it by a suitable nonnegative constant turns it into a probability distribution μ that still fulfills the property that μ(g)<inff∈^Fμ(f), and we're done.
Since our two missing part of the proof were cleaned up, we've proven the overall theorem.
Theorem 2: Lp Ball Theorem
For any ϵ>0, p∈[1,∞], f:X→R, and ν:ΔX, the infrafunction corresponding to {g|(∫|f−g|pdν)1p≤ϵ} (Knightian uncertainty over the Lp ball of size ϵ centered at f, w.r.t ν), is μ↦μ(f)−ϵ||dμdν||q where 1p+1q=1. Further, given a function f, the optimal μ to pick is the distribution a⋅ν⋅max(f−b,0)p−1 for some constants a>0 and b≤max(f).
Proof: The Lp norm of a function w.r.t. ν will be denoted as ||g||p. The function associated with the worst-case amongst the Lp ball centered at f is λμ.infg:||f−g||p≤ϵμ(g) We can rewrite this as f plus a function h that has an Lp norm of ϵ or less. So we get the equivalent function =λμ.infh:||h||p≤ϵμ(f+h) =λμ.infh:||h||p≤ϵμ(f)+μ(h) =λμ.μ(f)+infh:||h||p≤ϵμ(h) =λμ.μ(f)−suph:||h||p≤ϵμ(h) Now, we'll show that suph:||h||p≤ϵμ(h)=||dμdν||q where 1p+1q=1, by showing that it's both ≤ and ≥ that quantity. For one inequality direction, we can go suph:||h||p≤ϵμ(h)=suph:||h||p≤ϵ∫hdμ=suph:||h||p≤ϵ∫hdμdνdν And then apply Holder's inequality. ≤suph:||h||p≤ϵ||h||p⋅||dμdν||q=ϵ||dμdν||q And so, we have one inequality direction, that suph:||h||p≤ϵμ(h)≤ϵ||dμdν||q For the other inequality direction, we can let h be defined as ϵ⋅(dμdν)1p−1⎛⎜⎝∫(dμdν)pp−1dν⎞⎟⎠1p
Our first order of business is showing that the Lp norm of h is ϵ or less. So, compute it. ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∫⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ϵ⋅(dμdν)1p−1(∫(dμdν)pp−1dν)1p⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠pdν⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠1p Distribute the power in. =⎛⎜ ⎜ ⎜⎝∫ϵp⋅(dμdν)pp−1∫(dμdν)pp−1dνdν⎞⎟ ⎟ ⎟⎠1p Pull the ϵp, group and cancel out. =(ϵp)1p=ϵ So, this is a valid choice of function h. So, we can compute (by pulling the multiplicative constants out) suph:||h||p≤ϵμ(h)≥μ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ϵ⋅(dμdν)1p−1(∫(dμdν)pp−1dν)1p⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠−1p∫(dμdν)1p−1dμ Now, since ∫(dμdν)1p−1dμ=∫(dμdν)1p−1dμdνdν=∫(dμdν)pp−1dν We can reexpress our old equation as =ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠−1p⎛⎝∫(dμdν)pp−1dν⎞⎠pp=ϵ⋅⎛⎝∫(dμdν)pp−1dν⎞⎠p−1p =ϵ⋅(∫(dμdν)qdν)1q=ϵ||dμdν||q So we have suph:||h||p≤ϵμ(h)≥ϵ||dμdν||q Since that's the other inequality direction, we have equality. Now, since we were last at =λμ.μ(f)−suph:||h||p≤ϵμ(h) we can proceed to =λμ.μ(f)−ϵ||dμdν||q And that addresses one part of the theorem, about the objective function we're minimizing.
Sadly, the last part, about the optimal μ, can only be done according to a physics level of rigor. Given two points x and y which the optimal μ assigns probability to, the increase in the objective function from adding a tiny little δ of probability mass to x must be the same as the increase from adding a tiny little δ of probability mass to y. This is because, otherwise, if x got a higher score than y, you could steal a tiny amount of probability mass from y, reallocate it to x, and you'd get a higher score.
So, we can ask, what is ∂∂x(μ↦μ(f)−ϵ||dμdν||q) for the optimal μ? We know that it will end up being equal for every x which is in the support of the optimal μ. Well, we can write it as ∂∂x(μ(f)−ϵ||dμdν||q) =∂∂x⎛⎝∫fdμ−ϵ(∫(dμdν)qdν)1q⎞⎠ =∂∂x(∫fdμ)−∂∂x⎛⎝ϵ(∫(dμdν)qdν)1q⎞⎠ =f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1∂∂x(∫(dμdν)qdν) =f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)∂∂x((dμdν)q) =f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)⋅1ν(x)q∂∂x(dμ)q =f(x)−ϵ⋅1q(∫(dμdν)qdν)1q−1ν(x)⋅1ν(x)q⋅q⋅μ(x)q−1 And cancel =f(x)−ϵ(∫(dμdν)qdν)1q−1(dμdν(x))q−1 Now, since ϵ(∫(dμdν)qdν)1q−1 is the same regardless of which x was picked, we can abbreviate it as α, and write things as ∂∂x(μ(f)−ϵ||dμdν||q)=f(x)−α(dμdν(x))q−1 As previously explained, this quantity must be the same for all x that are supported by the optimal μ. In fact, this quantity can be called b. So, for every x with support, we have that b=f(x)−α(dμdν(x))q−1 We can rearrange this α(dμdν(x))q−1=f(x)−b (dμdν(x))q−1=f(x)−bα(f(x)−b) dμdν(x)=(f(x)−bα)1q−1 And now we see that, since the left-hand side is always nonnegative, f(x)−b must also be nonnegative too. And for any x which fulfilled the above equation, it can be rearranged to show that this x must have its partial derivative in the x direction being the maximal value of b, as expected, and so μ would not be harmed by placing any probability mass on that x. So, without loss of generality, we can assume that μ is supported over all x which fulfill the above equation, and make it into an equation that's valid for \emph{all} x by going dμdν(x)=(max(f(x)−b,0)α)1q−1 Rearrange. μ(x)=ν(x)⋅(max(f(x)−b,0)α)1q−1 Now, (1α)1q−1 is a constant, so call it a. And since it's valid over all x, we can rephrase it as μ=a⋅ν⋅max(f−b,0)1q−1 And now, since 1p+1q=1, we can rearrange this into q=pp−1, and so 1q−1=1pp−1−p−1p−1=11p−1=p−1 And so our equation rearranges into μ=a⋅ν⋅max(f−b,0)p−1 As desired.
Theorem 3: Dynamic Consistency
For any environment e, finite history h, off-history policy π¬h, and infrafunctions U and V, we have that} U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e) \emph{Or, restating in words, selecting the after-h policy by argmaxing for U|e,h,π¬h makes an overall policy that outscores the overall policy made by selecting the after-h policy to argmax for V.
Proof: U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e) Use π∗ to abbreviate the optimal policy for U|e,h,π¬h, and interacting with the environment e|h. =U((π¬h∙π∗)⋅e) =minU∈^UE(π¬h∙π∗)⋅e[U] =minU∈^UE(π¬h∙π∗)⋅e[1¬hU]+E(π¬h∙π∗)⋅e[1hU] =minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ∗⋅(e|h)[U↓h] Now, since π∗ is optimal for U|e,h,π¬h and interacting with e|h, that means that for all other post-h policies π′, we have that (U|e,h,π¬h)(π∗⋅(e|h))≥(U|e,h,π¬h)(π′⋅(e|h)) Unpacking the definition of the updated infrafunction, since everything in the set associated with U|e,h,π¬h came from the set associated with U, this turns into minU∈^UEπ∗⋅(e|h)[Eπ¬h⋅e[1h]⋅U↓h+Eπ¬h⋅e[1¬hU]]≥minU∈^UEπ′⋅(e|h)[Eπ¬h⋅e[1h]⋅U↓h+Eπ¬h⋅e[1¬hU]] Reshuffling some expectations, this turns into minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]Eπ∗⋅(e|h)[U↓h]≥minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]Eπ′⋅(e|h)[U↓h] Let π′ be the policy that optimizes V(π′⋅(e|h)). Now, we were previously at =minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ∗⋅(e|h)[U↓h] but using the above inequality, we can proceed to ≥minU∈^UEπ¬h⋅e[1¬hU]+Eπ¬h⋅e[1h]⋅Eπ′⋅(e|h)[U↓h] =minU∈^UE(π¬h∙π′)⋅e[1¬hU]+E(π¬h∙π′)⋅e[1hU] =minU∈^UE(π¬h∙π′)⋅e[U] =U((π¬h∙π′)⋅e) =U((π¬h∙argmaxπ′V(π′⋅(e|h)))⋅e) And we're done, we've shown our desired inequality U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e) And the U, V, and everything else was arbitrary, so it's universally valid.
Proposition 2: Lp Double Integral Inequality
If f≥0, let ∮pfdμ be an abbreviation for (∫fpdμ)1p. Then for all p∈[−∞,1], and μ:ΔX and ν:ΔY and f:X×Y→R≥0, we have that ∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν
Here's why. Define the function g:Y→R≥0 as g(y)=(∫f(x,y)dμ)p−1. Reverse Holder's inequality, because p≤1, says that, for all x, ∫f(x,y)g(y)dν≥∮pf(x,y)dν⋅∮qg(y)dν Where 1p+1q=1. So, using that inequality, we can derive that ∫f(x,y)g(y)dν∮qg(y)dν≥∮pf(x,y)dν And so we derive that ∫∮pf(x,y)dνdμ≤∫∫f(x,y)g(y)dν∮qg(y)dνdμ Pull the constant out, and group this into a double integral. =(∮qg(y)dν)−1⋅∫∫f(x,y)g(y)dνdμ Swap the order of double integration, and unpack what ∮q is. =⎛⎝(∫g(y)qdν)1q⎞⎠−1⋅∫∫f(x,y)g(y)dμdν Pull the constant g(y) partway out of the inner integral, and also collapse the two powers on the left-hand side. =(∫g(y)qdν)−1q⋅∫g(y)∫f(x,y)dμdν Now, since 1q+1p=1, solving for q, we get that it is equal to pp−1. Similarly, −1q=1p−1. Make those substitutions. =(∫g(y)pp−1dν)1p−1⋅∫g(y)∫f(x,y)dμdν Unpack what g(y) is. =⎛⎜⎝∫((∫f(x,y)dμ)p−1)pp−1dν⎞⎟⎠1p−1⋅∫(∫f(x,y)dμ)p−1(∫f(x,y)dμ)dν Simplify =(∫(∫f(x,y)dμ)pdν)1p−1⋅∫(∫f(x,y)dμ)pdν Cancel =(∫(∫f(x,y)dμ)pdν)1p Rewrite =∮p∫f(x,y)dμdν And we're done! We've shown that ∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν For some values of p, like 0 or 1 or −∞, these quantities will be ill-defined, but this can be handled by taking limits of the p where this argument works.
Corollary 1: Lp-Averaging is Well-Defined
Given any distribution ν over a family of functions or infrafunctions, define the Lp-average of this family (for p∈[−∞,1]) as the function μ↦∮pFi(μ)dν If all infrafunctions Fi(μ) are ≥0, then the Lp-average will be an infrafunction as well.
Proof: The main thing we need to verify is concavity of the resulting function. So, fix some μ1 and μ2, and use q as our mixing parameter. Then, q(∮pFidν)(μ1)+(1−q)(∮pFidν)(μ2) =q∮pFi(μ1)dν+(1−q)∮pFi(μ2)dν Now, this can actually be interpreted as an ordinary integral outside of a p-integral. The outer integral is over the space of two points, and f(1,F)=F(μ1) and f(2,F)=F(μ2). So applying our inequality from the previous theorem, we get ≤∮pqFi(μ1)+(1−q)Fi(μ2)dν And then apply concavity of each F. ≤∮pFi(qμ1+(1−q)μ2)dν Concavity has been proved.
Theorem 4: Crappy Optimizer Theorem
For any selection process s where Q(s) fulfills the four properties, ∀f:Q(s)(f)=maxμ∈Ψμ(f) will hold for some closed convex set of probability distributions Ψ. Conversely, the function f↦maxμ∈Ψμ(f) for any closed convex set Ψ will fulfill the four properties of an optimization process.
It's already a theorem that any function ψ:(X→R)→R which fulfills the three properties of concavity, monotonicity, and ψ(af+c)=aψ(f)+c, has a representation as minimizing over a closed convex set of probability distributions, and that minimizing over a closed convex set of probability distributions will produce a functional which fulfills those four properties. It's somewhere in Less Basic Inframeasure Theory. Well, actually, I proved some stuff about ultradistributions and all the proofs were the same just flipping concave to convex, and minimization to maximization, in an extremely straightforward way. And I proved that those three properties (but with convex flipped to concave) were equivalent to minimizing over a closed convex set of probability distributions, so the same result should hold.
Very similar arguments prove that any function which fulfills the three properties of convexity, monotonicity, and being able to pull multiplicative and additive constants out, has a representation as maximizing over a closed convex set of probability distributions Ψ, and vice-versa.
So, that just leaves proving the three defining properties of an ultradistribution from the four properties assumed of the function Q(s), and vice-versa. As a recap, the four properties assumed were P1:∀f,c∈R:Q(s)(f+c)=Q(s)(f)+c P2:∀f,a≥0:Q(s)(af)=aQ(s)(f) P3:∀f,g:Q(s)(f+g)≤Q(s)(f)+Q(s)(g) P4:∀f:f≤0→Q(s)(f)≤0 And the four defining properties of ultradistributions are U1:∀f,g,p∈[0,1]:ψ(pf+(1−p)g)≤pψ(f)+(1−p)ψ(g) U2:∀f,g:f≤g→ψ(f)≤ψ(g) U3:∀f,c∈R,a≥0:ψ(af+c)=aψ(f)+c Proof of U1 from P2 and P3: Q(s)(pf+(1−p)g)≤Q(s)(pf)+Q(s)((1−p)g)=pQ(s)(f)+(1−p)Q(s)(g) Proof of U2 from P3 and P4 and f≤g Q(s)(f)=Q(s)((f−g)+g)≤Q(s)(f−g)+Q(s)(g)≤Q(s)(g) Proof of U3 from P1 and P2: Q(s)(af+c)=Q(s)(af)+c=aQ(s)(f)+c Now for the reverse direction!
Proof of P1 from U3: ψ(f+c)=ψ(f)+c Proof of P2 from U3: ψ(af)=aψ(f) Proof of P3 from U1 and U3: ψ(f+g)=ψ(p1pf+(1−p)11−pg)≤pψ(1pf)+(1−p)ψ(11−pg)=ψ(f)+ψ(g) Proof of P4 from U2 and U3 and f≤0 ψ(f)≤ψ(0)=ψ(0⋅0)=0⋅ψ(0)=0 And done!
Proposition 3:
The space of infrafunctions is a □-algebra, with the function flatFX:□FX→FX being defined as λψ.λμ.ψ(λF.F(μ)).
We just need to show two commuting diagrams. For the square one, fix an arbitrary Ψ:□□FX and μ∈ΔX. flatFX(flat□FX(Ψ))(μ) Unpack the outer layer of flattening. =flat□FX(Ψ)(λF.F(μ)) Unpack the next layer of flattening, via the usual way that flattening works for infradistributions. =Ψ(λψ.ψ(λF.F(μ))) Regroup this back up in a different way with how flatFX works. =Ψ(λψ.flatFX(ψ)(μ)) Write this as the beta reduction of the more complicated term =Ψ(λψ.(λF.F(μ))(flatFX(ψ))) Write this as a pushforward along an infrakernel. =flatFX∗(Ψ)(λF.F(μ)) Write this as a flattening. =flatFX(flatFX∗(Ψ))(μ) Now, since this holds for all μ, that means that flatFX(flat□FX(Ψ))=flatFX(flatFX∗(Ψ)) And since this holds for all Ψ, this means that flatFX∘flat□FX=flatFX∘flatFX∗ This demonstrates the first commutative diagram of an algebra of a monad, the one that looks like a square. Now for the forward-and-back digram composing to identity.
We need that the usual embedding e:FX→□FX given by e(F)=λg.g(F) where g:FX→R, has flatFX∘e=idFX. Let's compute it. Fix an arbitrary F:FX and a μ:ΔX. Then flatFX(e(F))(μ) =flatFX(λg.g(F))(μ) =(λg.g(F))(λG.G(μ)) =(λG.G(μ))(F) =F(μ)=idFX(F)(μ) And this holds for all μ and F, so composing flattening and the usual embedding produces the identity, which is the second property to declare something a □-algebra.
Proposition 4:
All continuous infrakernels k:X→□Y induce a function FY→FX via G↦(μ↦G(k∗(μ))
We've got two things to verify. Namely, concavity, and upper-semicontinuity, of λμ.G(k∗(μ)). Upper-semicontinuity is easy, because if μn converges to μ, then k∗(μn) converges to k∗(μ), and then upper-semicontinuity of G takes over. For concavity, we have G(k∗(pμ1+(1−p)μ2))=G(pk∗(μ1)+(1−p)k∗(μ2))≥pG(k∗(μ1))+(1−p)G(k∗(μ2)) And it's proved.