Proposition 48:A function K:X→□Y is an infrakernel and fulfills the property: ∀CX∈K(X):{K(x)|x∈CX} is b-uniform iff the function K is bounded and continuous when □Y is equipped with the IKR metric.
So, our first direction is to assume that the relevant function is bounded and continuous, and show that it fulfills the infrakernel properties and the one extra compact b-uniformity property. The second direction is to show that, assuming the b-uniformity property and being an infrakernel, the function is bounded and continuous.
Anyways, so let's assume the function K is bounded and continuous. Ie,
And, if xn limits to x, then K(xn) limits to K(x) in Hausdorff-distance/the infra-KR metric.
First, for purposes of disproof, let's assume K doesn't have bounded Lipschitz constant/amount of measure present. That is,
∀λ∃x∈X,(m,b)∈K(x):m(1)≥λ
Then, you can let λ be any number you want, no matter how large, and find some xλ where (mλ,bλ)∈K(xλ) and m(1)≥λ and bλ≥1. Now, let's look at the norm of the infradistribution K(xλ). Plug in the constant function −bλ and we have:
K(xλ)(−bλ)≤mλ(−bλ)+bλ≤−λbλ+bλ
This is because mλ had λ or more measure present. Then, making that substitution, we have:
And at this point we observe that the Lipschitz constant of a constant is 0, and the norm of a constant is just its absolute value, and bλ≥1 by assumption, to get:
=|K(xλ)(−bλ)|bλ
Now we use our earlier inequalities and the absolute value (everything in our earlier inequalities is a negative quantity) to get:
≥λbλ−bλbλ=λ−1
But λ can be unboundedly high, so our maximum norm is infinity, which we stipulated isn't the case. Thus, K does have bounded Lipschitz constant (this was derived from K's bounded IKR-norm and continuity).
Now, we just need to derive the other two conditions on an infrakernel from boundedness and continuity of K.
If K is a continuous function X→□Y, and CX is an arbitrary compact subset of X, then K(CX) must be a compact subset of □Y, which, from our characterization of the iff conditions of compactness in Proposition 46, means that it fulfills the shared compact-almost-support property and the b-uniformity condition. Thus the function K fulfills the compact-shared compact-almost-support property, and the compact b-uniformity condition. Finally, for the pointwise-convergence condition of an infrakernel, continuity of the function X→□Y means that if xn limits to x, then K(xn) limits to K(x) in Hausdorff-distance, so by Proposition 45, for all continuous bounded functions f, K(xn)(f) limits to K(x)(f), and this is the last condition we need to call K an infrakernel.
So, bounded and continuous functions X→□Y imply that the function is an infrakernel and fulfills the compact b-uniformity condition. Time for the reverse direction. Let's assume that K is an infrakernel X→□Y and fufills the compact b-uniformity condition. For showing boundedness, there's some Lipschitz constant upper bound λ⊙, and we can go:
supx∈X(supf∈Clip|K(x)(f)|max(Li(f),||f||,1))
≤supx∈X(supf∈Clipλ⊙||f||max(Li(f),||f||,1))
≤supx∈X(supf∈Clipλ⊙)=λ⊙<∞
And we have established boundedness. For continuity, it will take a more detailed argument. Let xn limit to x. Our task is to show that K(xn) limits to K(x) in Hausdorff-distance/infra-KR distance.
Now, the set {xn}n∈N∪{∞} is compact, so by the conditions for an infrakernel, there's a bound on the λ value of the minimal points of the K(xn) infradistributions, and the K(xn) sets have shared compact-almost-supports. We're also assuming the compact b-uniformity condition, so the K(xn) sets have the b-uniformity condition on them. By our necessary and sufficient conditions to be precompact, we have that the set of points {K(xn)}n∈N∪{∞} is precompact in □Y, so we can isolate a convergent subsequence of the sequence of infradistributions K(xn).
What we'll do is show that all convergent subsequences of K(xn) have the same limit point, namely K(x). Here's the argument for it. Index the convergent subsequence with m and have K∞ be the limiting infradistribution of it. From the pointwise convergence condition on an infrakernel,
K(x)(f)=limn→∞K(xn)(f)=limm→∞K(xm)(f)=K∞(f)
Thus, our arbitrarily generated limit point K∞ must perfectly match up with K(x) and be identical to it, so since there's only one limit point, the sequence K(xn) actually converges to K(x). So, because convergent sequences in the input induce convergent sequences in the output, the function K:X→□Y must be continuous, and we've shown the other direction, that infrakernels fulfilling the compact b-uniformity condition must be bounded and continuous.
Proposition 49:The space □X equipped with the IKR-topology is a Polish space if X is too.
So, we've got a metric, the infra-KR distance ie the Hausdorff-distance. We know that the notion of convergence remains independent of which metric on X these metrics are being induced by, so the topology is unchanged, and they're all metrizations. We also know from Proposition 43 that □X is complete under any infra-KR distance ie Hausdorff-distance. So, □X is completely metrizable by any infra-KR distance induced by a complete metric on X, all that remains is to show that it's a separable space, we need to find a countable dense subset.
We know that a single point is compact, and we also know that compactness implies that for all ϵ, there is a b∗ value where dhau(H,Hb∗)≤ϵ by Proposition 46, b-uniformity.
So, the space of infradistributions with the b values of their minimal points having an upper bound is dense in the space of infradistributions. All we need to do now is to have a dense countable subset of the space of infradistributions with upper bounds on the b values of their minimal points.
The way we'll be doing this is taking a dense countable subset of the space of a-measures. There are only countably many ways to choose finitely many points from this subset, and a choice of finitely many a-measures can be taken to define an infradistribution by convex hull, closure and upper-completion. Assuming the fiddly details work out, we'll have countably many infradistributions which are (hopefully) dense in the space of infradistributions with bounded b values, which are dense in the space of infradistributions, so we've got our dense countable subset.
There's still much to do. We need to show that there's a dense countable subset of the space of a-measures at all, we need to deal with technicalities involving normalization to show that we always get an infradistribution by this process, and we need to show that given any infradistribution with bounded b values, we can make an infradistribution of this form arbitrarily close to it in Hausdorff-distance.
For our first part, the cone of a-measures can be viewed as M+(X)×R≥0, and you can also view the space of measures over X as ΔX×R≥0 (a pair of a distribution and a scaling term). This isn't quite right, because there's an identification of everything when the amount of measure is 0, but this doesn't really matter. So, we're trying to find a dense countable subset of
ΔX×R≥0×R≥0
Now, when X is a Polish space, ΔX equipped with the KR-metric is also a Polish space, so there's a dense countable subset. You can take the product of this with the dense countable subset of R≥0 consisting of the rational numbers 0 and above, to get a dense countable subset of the cone of a-measures.
For the normalization technicalities, we need to make sure that whenever we pick finitely many points from our dense countable subset, there's a point with b=0, a point with λ+b=1, and all points have λ+b≥1, in order to have normalization work and your batch of points define an actual infradistribution. This can always be done, throwing out the "bad" choices of finitely many points still leaves you with a countable set.
Finally, given an infradistribution Hb∗ with its minimal points all having b≤b∗, can we make infradistributions arbitrarily close to it in Hausdorff-distance of this form where we just specify finitely many points?
Well, yes. Because of the compact-projection property of infradistribution sets, and truncating at b∗, we can see that all the minimal points of Hb∗ lie in a compact set. So, given any ϵ, we can cover Hb∗ (well, the part with b values below b∗) with ϵ-sized balls centered at its points, get a finite subcover from compactness, and then use denseness of our countable subset to take our finitely many balls with center points and pick something from the dense countable subset in each ball in order to get: A collection of finitely many points from the dense countable subset where each point in Hb∗ with b value below b∗ is ϵ distance away from one of these finitely many points, so the closed convex hull of these finitely many points is only ϵ distance away from Hb∗ (but truncated at b∗), and then when upper completion is done, they stay only ϵ distance away from each other.
So, given any infradistribution with bounded b values, and any ϵ, we can find finitely many points from a dense countable subset of the space of a-measures, which induces an infradistribution that's only ϵ distance away. Because there are only countably many ways to pick finitely many points from a countable set, we have a countable set of infradistributions where any infradistribution has infradistributions of this restricted form arbitrarily close to them, and so we have separability, the last condition we need for Polishness.
Proposition 50:The infra-Vietoris topology and the topology induced by the IKR metrics are the same.
This proof will proceed in three ways. The first is showing that this collection of sets declared-to-be-open in the infra-Vietoris topology is closed under finite intersections, and thus forms a basis (instead of a sub-basis). The second way is taking any basis open set from the basis induced by the IKR metric (we'll be using the strongly equivalent Hausdorff-metric for this) and any point/infradistribution in it, and finding a Infra-Vietoris open set from the basis which is a subset of the open ball in the IKR metric. Finally, we taken any basis open set from the infra-Vietoris topology, and any point/infradistribution in such a set, and find a tiny open ball w.r.t. the IKR metric that fits within the infra-Vietoris open set.
So, here's what we're going to do: First, we're going to show that the intersection of two open sets from the basis is also in the basis. Obviously, we only need to show this for pairwise intersection, as that shows closure under finite intersection.
Consider the open set O1 of infradistributions to be generated by the finite family of bounded-b open sets Oi, and the open set O2 of infradistributions to be generated by the finite family of bounded-b open sets Oj.
Claim: The finite family of bounded-b open sets Oi∩(⋃jOucj) and Oj∩(⋃iOuci), for all the i and j, is a suitable batch of open sets in the a-measures to perfectly replicate the O1∩O2 open set of infradistributions.
The properties for H∈O1∩O2 are:
H⊆⋃iOuci
H⊆⋃jOucj
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃iOi)
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃jOj)
And, the properties for H lying in the set induced by the Oi∩(⋃jOucj) and Oj∩(⋃iOuci) families of open sets (upper completion of open set is open) is:
So let's get started on showing equivalence of these things. We'll use M for an a-measure, and use a subscript of h,i, or j to denote which sort of set it's in.
First, let's take the conjunction of these two things.
H⊆⋃iOuci
H⊆⋃jOucj
This is equivalent to "for any Mh, it is above some Mi and also above some Mj". Since the only notion of upper completion is adding the +b term, this is equivalent to "for any Mh, we are in one of two cases, either there is a Mi,Mj such that Mh≥Mi≥Mj, or that Mh≥Mj≥Mi"
Similarly,
H⊆⋃i(Oi∩⋃jOucj)uc∪⋃j(Oj∩⋃iOuci)uc
Is saying "for any Mh, we are one of two cases, either there is a Mi,Mj such that Mh≥Mi≥Mj, or vice-versa". And so these two conditions are equivalent.
Also, if you have both of:
∀i:H∩Oi∩⋃jOucj≠∅
∀j:H∩Oj∩⋃iOuci≠∅
Then you trivially have
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
And in the reverse direction, if you have:
H⊆⋃iOuci
H⊆⋃jOucj
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
then you'll get both of
∀i:H∩Oi∩⋃jOucj≠∅
∀j:H∩Oj∩⋃iOuci≠∅
That just leaves the projection property.
If you have
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃iOi)
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃jOj)
then, let's say (m,0) is your arbitrary point in the closure of the projection of H. This is saying (m,0)≤Mi and (m,0)≤Mj. Again, by how upper completion is just adding b, this is equivalent to "given an arbitary point (m,0) in the closure of the projection of H, either (m,0)≤Mi≤Mj or (m,0)≤Mj≤Mi"
is saying "given an arbitary point (m,0) in the closure of the projection of H, either (m,0)≤Mj≤Mi or (m,0)≤Mi≤Mj", which is clearly equivalent. So, we've got our bidirectional equivalence, and any binary (and thus finite) intersection of infra-Vietoris basis opens is an infra-Vietoris basis open, so it is indeed a basis.
Now we must show that this infra-Vietoris topology is identical to the topology induced by the infra-KR metric/Hausdorff distance metric. This can be done by taking any ball in any Hausdorff distance metric centered at a point/infradistribution, and finding an infra-Vietoris open set that contains the same point and is a subset of the open ball. Conversely, we should be able to take any infra-Vietoris open set, and any infradistribution/point within it, and find an IKR open ball centered on that point that lies entirely within the infra-Vietoris open set, showing that the two topologies are equal.
So, first, assuming we've got a Hausdorff-distance of ϵ centered at some infradistribution, can we make an infra-Vietoris set that lies entirely within the ball and engulfs the center point?
Well, there's some b cutoff where the projection of the infradistribution H (cut off at that b value) is within Hausdorff-distance ϵ4 of the projection of the entire infradistribution H to the measure component. Accordingly, we have one of our open sets be a ϵ2-thickening of the chopped-off infradistribution, as this upper completion engulfs the entire infradistribution H within it with ϵ2 room to spare. The rest is done by covering the compact chopped-off portion with all possible open balls of size ϵ4, and using compactness of the chopped-off infradistribution to pick a finite subcover/finitely many open balls. H has the projection of its closure lie entirely within the ϵ2-thickening of the lower part, has nonempty intersection with each open ball, and is a subset of the upper completion of the union of everything.
Now, our task is to show that any infradistribution that lies in the open set induced by that finite collection of open sets lies in the Hausdorff-distance-of-ϵ ball centered at H. So, let our arbitrary new infradistribution H′ be a subset of the union of the upper completion of all these open sets, and have nonempty intersection with each of them, and have its closure lie in the projection of the union of them. Note that the initial open set which engulfs the entire bottom portion has all the other open sets as subsets of it, so the union and upper closure of the opens is just the upper completion of the "big" open set. Any point in H′ must lie in this "big" open set, and it's made by thickening up H (but chopped off) and taking the upper completion, witnessing that the distance from H′ to H is ϵ2 or less.
For the other direction, pick an arbitrary point in H. It's within ϵ4 of the upper completion of one of the finitely many balls used for the open cover. And because those center points are center points of ϵ4-open balls which H′ must have nonempty intersection with, and H′ is upper-complete, it has a point of distance ϵ2 or less from your starting point in H. Thus, anything in this basis open set has Hausdorff-distance of ϵ2 or less from the center point H, so all open balls in the infra-KR metric topology can have their central point engulfed by an infra-Vietoris open set, which is a subset of the initial open ball. So we're half-done, we know that all open sets of the infra-KR metric are open in the infra-Vietoris topology.
In the other direction, we need to take an infra-Vietoris open set, and an infradistribution in it, and find some tiny hausdorff-distance from that arbitrary infradistribution that remains in the infra-Vietoris open set. Let H be your infradistribution, which is a subset of the union of the upper completion of the Oi, and has nonempty intersection with each Oi, and the closure of its projection is a subset of the projections of the Oi.
Now, there's one important fact we have to establish first. It is impossible to have a sequence Mn∈H that gets arbitrarily close to the complement of ⋃iOuci. Fix some sequence Mn∈H and assume that this sequence gets arbitrarily close to the complement of ⋃iOuci.
One of two things can happen. The first thing that can happen is that Mn has a subsequence with bounded b value. Then, we can pick out a convergent subsequence of that subsequence because infradistributions, when cut off at any finite b value, are compact, and take the limit point M∞, which lies in the set H due to closure, and yet has distance 0 from the exterior of ⋃iOuci, which, being the complement of an open set, is closed. So, we've got a point in H and also not in ⋃iOuci, which is impossible.
Therefore, the Mn sequence must have unboundedly large b value as n increases. This projects down to make an mn sequence of measures, in the compact set ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H), and we can isolate a convergent subsequence of those, with limit point m∞. Now, something interesting happens.
For the Mn sequence of a-measures, eventually the b value of them goes well past the largest b value of one of the finitely many bounded-b open sets. And when we go a tiny distance away to a point M∗n which lies outside of ⋃iOuci, then... well, it's got nothing from any of the Oi below it. And nothing from any of the Oi above it either, because the b value of M∗n is too high since it's near Mn. So, M∗n projects down to some measure m∗n which lies outside of prm(⋃iOi). And since M∗n is really close to Mn, mn (in the closure of the projection of H) is really close to the exterior of the projection of ⋃iOi, as witnessed by m∗n. And this is a closed set, because the complement of an open set is closed, and the projection of an open set is open.
This lets us show that our m∞ limit point lies in the complement of prm(⋃iOi), which is impossible, because m∞ lies in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H) which is a subset of prm(⋃iOi).
So our original assumption was wrong, and there is some δ where any point in H is always δ or more distance away from the exterior of ⋃iOuci. Further, for the Oi, since H∩Oi≠∅, you should be able to pick said point, and find some ϵi-sized open ball you can stick around it that lies entirely within the open set Oi.
Now, a Hausdorff-distance of min(δ,miniϵi)2 around H, all infradistributions that close must lie in the same Vietoris-open set. Here's why. Since all points in your new infradistribution H′ are at most min(δ,miniϵi)2 distance away from a point in H, and given any point in H you need to travel δ distance to get to the exterior of ⋃iOuci, H′ is a subset of that open set as well.
Also, ∀i:H′∩Oi≠∅, because given the central points in the H∩Oi sets, you can travel min(δ,miniϵi)2 distance to get to a point in H′, which lies in the ϵi-sized open ball around said point which still lies entirely within the Oi. So any H′ that close must intersect the same collection of open sets.
Finally, the projection thing is taken care of by the same sort of Hausdorff-distance argument we used to argue that said H′ must lie in the union of upper-completions of the open sets. So, given any Vietoris-open, and any point in it, we can find an IKR-open which engulfs said point and lies entirely in the Vietoris open.
Therefore, the two topologies are the same.
Theorem 3:Given a crisp infrakernel K:Xik→X then if there's a c<1 where, regardless of x and x′, dhau(K(x),K(x′))≤c⋅d(x,x′), there's a unique crisp infradistribution S where K∗(S)=S, and can be found by starting with any crisp infradistribution on X, and repeatedly applying K.
This theorem is going to require showing a lot of fiddly measure-theoretic stuff, and then once everything's all set up, the actual theorem itself falls pretty fast. So, we're going to start off by noting or proving three things needed to make progress.
The first is: "given any measurable function s:X→ΔX, and any probability distribution μ∈ΔX, and any ϵ, we have that there is a continuous function g:X→ΔX s.t. ∫X||g−s||dμ<ϵ".
This is an analogue of the celebrated result that continuous functions are dense in the measurable functions. I'm gonna skip small bits of it, to be reconstructed if you ask. Pretty much, we use the fact that ΔX is separable in the KR-metric, and the results from this pdf to get a few results. ΔX is separable according to the KR-metric, and s has all its output values in that separable space, so it's separably valued. Also s is measurable, so by proposition 1.8 in the paper, s is strongly measurable, so it's the pointwise limit of a sequence of simple functions. The actual proof of this fact goes through Theorem 1.5 in that paper, and looking carefully at Theorem 1.5, the simple functions constructed have their value in ΔX, because the proof is like "fix a dense countable subset of your space of interest, then modify s to be a simple function by rounding off the outputs to the nearest of a finite collection of points from that subset". Proposition 1.10 from that paper says that since s is strongly measurable, then given any μ∈ΔX, s is strongly μ-measurable, and then proposition 1.16 from that paper says that s is Bochner-integrable because ∫X||s(x)||dμ=1<∞, because all the s(x) are probability distributions. Further, part of the proof of proposition 1.16 from that paper shows that the ΔX-valued simple functions, due to their bounded norm, converge to s in the L1 norm by the dominated convergence theorem.
Anyways, at this point, we've shown that any measurable function s:X→ΔX has simple ΔX-valued functions that limit to it, so the simple functions are dense in the measurable functions X→ΔX. Thus, at this point, we just need to show that we can make continuous functions X→ΔX that are arbitrarily close to some simple function χ:X→ΔX.
Let's begin. Our simple functions χ, from looking carefully at the proof of Theorem 1.5 from the linked paper, take the form of partitioning X into finitely many disjoint and covering measurable sets Ai, and mapping each of them to some probability distribution νi∈ΔX. Let n be how many disjoint sets there are.
Thus, our task is to take a simple function χ:X→ΔX, and find a continuous function g:X→ΔX where ∫X||χ(x)−g(x)||dμ<ϵ If we can do this for any ϵ, then it will show that continuous functions are dense in the simple functions which are dense in the measurable functions in L1.
So, here's what you do. Every probability distribution over a Polish space has the property that the measure of any measurable set A can be approximated from below by the measure of a compact subset C. We have finitely many disjoint measurable sets Ai, and we associate each of them with a compact subset Ci where μ(Ai/Ci)<ϵn.
Let's abuse notation a bit and have d(x,Ci) be the minimum distance from x to a point in Ci. Then, let g be defined as:
g(x):=∑i≤n⎛⎝1d(x,Ci)∑j1d(x,Cj)νi⎞⎠
Admittedly, the distance may be 0, leading to ∞∞, however, this will be considered to be 1. The reason this works and is continuous is because all the Ci are bounded away from each other. This is because they're all subsets of Ai which are disjoint, so all the Ci are disjoint, and any two compact sets which are disjoint must be bounded away from each other, because otherwise you could fix a sequence in one compact set getting closer and closer to the other compact set, isolate a convergent subsequence, and it'd limit to be distance 0 from the other compact set, and so in it, contradicting disjointness. Now, we can go:
∫X||g(x)−χ(x)||dμ
=∑i≤n(∫Ai||g(x)−χ(x)||dμ)
=∑i≤n(∫Ci||g(x)−χ(x)||dμ+∫Ai/Ci||g(x)−χ(x)||dμ)
And then, for the next step, we observe that when x∈Ci, χ is like "well it's in Ai, I'll return νi as my output", and g is like "well, it's distance 0 from Ai, so I turn into ∞∞νi=νi by convention". Further, g always returns probability distributions, and the maximum KR-distance two probability distributions can have (when the original space X has its metric bounded above by 1, which can always be imposed without altering the topology) is 1. So, we get
≤∑i≤n(∫Ci||νi−νi||dμ+∫Ai/Ci1dμ)
=∑i≤n(∫Ai/Ci1dμ)=∑i≤n(μ(Ai/Ci))
<∑i≤nϵn=ϵ
And we're done. We showed that any measurable function s:X→ΔX could be arbitrarily closely approximated in L1 by a simple function χ:X→ΔX, and that any simple function could be arbitrarily closely approximated in L1 by a continuous function g:X→ΔX.
The next fact we'll need is that, when X has its distance metric bounded above by 1, which can always be done without altering the topology, then the KR-distance on probability distributions is exactly equal to the Earthmover distance, the first Wasserstein metric.
The third and final fact we need is that, if you have an function K:X→K(ΔX) (function to the space of compact subsets of ΔX), which is continuous w.r.t. the Hausdorff metric, and always produces compact convex sets, then the function ψ:X×X→K(ΔX) (defined relative to some continuous function g:X→ΔX and some ϵ), defined by:
The requirement for being lower-hemicontinuous is if you have some sequence (xn,x′n) limiting to (x,x′), and some ν∈ψ(x,x′), then you should be able to find some sequence of points νn∈ψ(xn,x′n) where νn limits to ν.
First, we'll need that, if x′n limits to x′, then ⋃n∈N∪{∞}K(x′n) is a compact set. We know that K is continuous and {x′n}n∈N∪{∞} is compact, so Lemma 3 steps in to show that said set is compact.
The relevant implication of this, when paired with Hausdorff-continuity of K, is that you can take any sequence of distributions νn∈K(x′n), and find a convergent subsequence that converges to something in K(x′).
The second piece for this argument we'll need is that:
So, first, we observe that since all the K(x′n) are compact, we can find actual minimizers νminn. We don't know that the limit actually exists, so pass to a subsequence indexed by m where we're taking the liminf. Pass to another subsequence where the νminm actually converge to a point in K(x′), which we can do by the immediately preceding point about compactness. Then,
Alright, that's part two of the argument for part 3 of the preliminaries for this theorem down. Time for part 3 of the argument. We'll assume that xn limits to x, x′n limits to x′, and pick an arbitrary ν s.t. ν∈K(x′), and
d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ
Let νn∈K(x′n) be some point in K(x′n) that's as close as possible to ν. Obviously, these limit to ν, due to Hausdorff-continuity of K. Non-obviously, a tail of them (all but finitely many), will lie in ψ(xn,x′n). Here's why.
Let δ be:
infν′∈K(x′)d(ν′,g(x))+ϵ−d(ν,g(x))
There is some finite n by which you are guaranteed three things forever afterwards.
First, dhau(K(x′n),K(x′))<δ3, because K is Hausdorff-continuous and x′n limits to x′.
Second, d(g(xn),g(x))<δ3, because g is continuous and xn limits to x.
Third,
infν′∈K(x′)d(ν′,g(x))<infν′∈K(x′n)d(ν′,g(xn))+δ3
Because we know that this minimum-distance quantity limits to the minimum-distance quantity of the limit, we showed it above. Now, at this point, we can go:
d(νn,g(xn))≤d(νn,ν)+d(ν,g(x))+d(g(x),g(xn))
≤dhau(K(x′n),K(x′))+d(ν,g(x))+d(g(x),g(xn))
<d(ν,g(x))+2δ3
=(infν′∈K(x′)d(ν′,g(x))+ϵ−δ)+2δ3
<((infν′∈K(x′n)d(ν′,g(xn))+δ3)+ϵ−δ)+2δ3
=infν′∈K(x′n)d(ν′,g(xn))+ϵ
This was splitting the distance up, observing that νn was selected to be as close as possible to ν, so it'd be as close or closer than the Hausdorff distance between the sets they were selected from, using that two of the distances were small, using how δ was defined, and using our bound on the minimum-distance quantity. Anyways, this shows that forever after some timestep,
Time for our final part! Showing lower-hemicontinuity of ψ requires showing that, for all sequences (xn,x′n) limiting to (x,x′), and all ν∈ψ(x,x′), we can find a sequence νn∈ψ(xn,x′n) limiting to ν. We just showed that for ν fulfilling the defining inequality. But we took the closure, didn't we? What if our ν was added in the closure? Then we can take a sequence νm∈ψ(x,x′) (but without the closure part) that does fulfill the defining inequality, and do a thing where we make a sequence νn,m∈ψ(xn,x′n) which limits to νm, one sequence for each m. Then, we basically proceed for a while in νn,0 until we have a guarantee that νn,1 will remain close to ν1, and switch to picking from the sequence νn,1, and repeat, slowly ratcheting up the second index over time. Said sequence will get arbitrarily close to the νm values, which get arbitrarily close to ν itself, so this "step up the sequence index when you've waited long enough" sequence can limit to ν
And with this argument, we know that ψ is lower-hemicontinuous.
Alright, now we can start embarking on the true proof instead of just setting everything up. Remember, we assumed that there was a crisp infrakernel K:X→ΔX, all the K(x) have compact sets of minimal points (by CAS), and that there was some fixed c<1 where dhau(K(x)min,K(x′)min)≤cd(x,x′) regardless of x and x′, and our goal is to show that there's a unique stationary crisp infradistribution which can be found by selecting any crisp infradistribution and repeatedly iterating K.
There's a bit of setup to do first, but thankfully it isn't as demanding as we've done before. First, we'll abuse notation a bit by using H and H′ and K(x) and K(x′) to refer, not to the infradistribution as a whole, but to their compact convex sets of minimal points, ie, sets of probability distributions.
Second, we recall from the proof of proposition what the form for K∗(H) (the pushforward of H) is. Well, actually, the set of minimal points. It is:
¯¯¯¯¯¯¯¯c.h(⋃μ∈HEμ[K(x)])
Ie, the closed convex hull of the union of mixtures of the sets of minimal points of K(x). Mixing sets with regards to a probability distribution is done by fixing a measurable selection function s:X→ΔX s.t. for all x, s(x)∈K(x). Eμ[s(x)] is a probability distribution, and then you just union all those points together for all the possible selection functions to make your set Eμ[K(x)].
Our final thing to recall (well, actually, look up from Wikipedia), is that since the Earthmover distance/1st Wasserstein metric/KR-distance is all the same, we'll be using an alternate formulation of distance. If μ,μ′ are probability distributions over X, then
d(μ,μ′)=infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′
Ie, of all probability distributions over X×X, restrict to those which make μ and μ′ when you project it down to the appropriate coordinates, and try to pick one that makes the integral of the distance function as low as possible. The lowest integral of distance you can get is the earthmover distance. You can check Wikipedia for this formulation. We'll be using this form once.
Let's begin. First, by the form of the set of minimal points of the pushforward,
And then, we observe that taking the closure doesn't affect the Hausdorff distance at all, so we can go:
=dhau(c.h(⋃μ∈HEμ[K(x)]),c.h(⋃μ′∈H′Eμ′[K(x)]))
≤dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])
The reason for that inequality is that convex hull doesn't increase the Hausdorff distance. If you've got a finite mix of points from ⋃μ∈HEμ[K(x)] in the convex hull, then you could just take each of those, find the closest point from the analogue for H′, and mix those together to make a point close to your thing added in the convex hull, so distance doesn't go up when you do convex hull.
Our mission is to impose a suitable bound on the second term, and once we've done that, show that K∗ has type K(ΔX)→K(ΔX) (maps compact sets of probability distributions to cmpact sets of probability distributions), and then we can apply the Banach fixpoint theorem on K(ΔX) to get a fix-set of probability distributions that doesn't alter when you apply K∗ (pushforward), and can be found by iteration.
For our first part, imposing a bound on
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])
We shall do the following. We will pick a point in the first set, and find a nearby point in the second set. A point from the first set must be associated with some μ∈ΔX, and some measurable selection function s:X→ΔX where, for all x, s(x)∈K(x). Said point is written as Eμ[s(x)], or alternatively, as ∫Xs(x)dμ. Our task is to find a point in the second set.
Construct it as follows: Select μ′∈H′ to be as close to μ as you can get. Pick an ϵ. Then, pick some ξ∈Δ(X×X) s.t. the projection of ξ to the first coordinate is μ and the projection to the second coordinate is μ′, and:
∫X×Xd(x,x′)dξ<infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ
Such a ξ can always be found. Pick a continuous function g:X→ΔX such that
∫X||s(x)−g(x)||dμ<ϵ
We can always pick such a function because we've already shown that continuous functions are dense in the measurable functions.
At this point, recall our function ψ:X×X→K(ΔX) defined as:
We know that said function will be lower-hemicontinuous. Further, it always produces closed sets (we explicitly took the closure), and they're nonempty (just look at it a bit), and are convex. Convexity is a bit more involved, and it occurs because a mix of points a certain distance or less from some set point (g(x) in this case), will also be the same or less distance away from said point, and the K(x′) are all convex. At this point, we can invoke the Michael selection theorem to get a continuous function g′:X×X→ΔX where, for all x,x′, g′(x,x′)∈ψ(x,x′).
At this point, remember our distribution ξ∈Δ(X×X) which projects down to μ′? Well, by the disintegration theorem (the usual probability theory version), ξ can be written as μ′⋉k′ where k′(x′)∈ΔX gives the conditional probability distribution over X if you see some x′.
And now we can finally define our measurable selection function s′, it is defined as: s′(x′)=∫x∈Xg′(x,x′)dk′(x′)
And, our final point that we will pick to be close to Eμ[s(x)] will be: Eμ′[s′(x′)].
Uh... Is s′ even a legit selection function? What we need is that for all x′, s′(x′)∈K(x′). Well, fear not.
And, because K(x′) is closed and we took the closure of a batch of points within it, we have:
g′(x,x′)∈ψ(x,x′)⊆K(x′)
And, because K(x′) is closed and compact, making s′(x′) by integrating over a bunch of different choices of x to plug in just makes the average of a batch of points within K(x′), which lies in K(x′), so we're good, s′(x′)∈K(x′) regardless of x′ and it's a legit selection function. Now, we can get into inequality shuffling!
And then... s(x)∈K(x), and ν′ is as close as possible within K(x′), so we have:
≤∫X×Xdhau(K(x),K(x′))dξ+3ϵ
And we know that, as one of our starting assumptions, dhau(K(x),K(x′))≤cd(x,x′) where c is a constant below 1. So,
≤∫X×Xcd(x,x′)dξ+3ϵ
=c∫X×Xd(x,x′)dξ+3ϵ
But wait, we selected \xi such that:
∫X×Xd(x,x′)dξ<infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ
So, making that substitution, we have:
<c(infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ)+3ϵ
<c(infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′)+4ϵ
And this is just the definition of earthmover distance, so we have:
=cd(μ,μ′)+4ϵ
And, μ′ was selected from H′ to be as close as possible to μ, so we've got one last inequality.
≤cdhau(H,H′)+4ϵ
Our final inequality, putting all this together, is that
d(Eμ[s(x)],Eμ′[s′(x′)])<cdhau(H,H′)+4ϵ
So, given an arbitrary point in H, we can craft a point in H′ that isn't too far away, and switch H′and H to get:
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])<cdhau(H,H′)+4ϵ
But wait, ϵ was arbitrary! So we can shrink it to 0 to get:
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])≤cdhau(H,H′)
Putting this bound together with what we were working on at the start,
dhau(K∗(H),K∗(H′))≤cdhau(H,H′)
Or, alternately,
dhau(K∗(H),K∗(H′))dhau(H,H′)≤c<1
So, K∗ is definitely a contraction mapping. But we do need to check that if H is a compact set of probability distributions, then K∗(H) is compact as well. Fortunately, this is doable because the pushforward of an infradistribution is an infradistribution, and you can chop off an infradistribution at any +b level and get a compact set of a-measures. So if we chop it off at 0, we get exactly what we want.
The Hausdorff-limit of a series of convex compact sets is convex compact, and we know that for all H,H′ that are compact convex,
dhau(K∗(H),K∗(H′))dhau(H,H′)≤c<1
So, K∗ is a contraction mapping on this closed complete metric space (the space of compact subsets of a complete space is complete), and we can apply the Banach fixpoint theorem to find a unique fixset S s.t. S=K∗(S). S is a convex compact set of probability distributions, so it corresponds to a crisp infradistribution that can be found by taking any crisp infradistribution and iterating K∗ (the pushfoward), and we're done!
Proposition 51:If h is the infradistribution corresponding to a probability distribution μ, then if α∈(1,∞), Hα(h)=Hα(μ), where the latter Hα is the conventional definition of Renyi entropy for probability distributions, and the former Hα is the definition of Renyi Entropy for infradistributions.
For an infradistribution, the Renyi entropy would be:
Hα(h)=α1−αln(maxqh(qα−1αi))
But it's a probability distribution corresponding to μ, so the h turns into an expectation.
=α1−αln(maxq(∑iμiqα−1αi))
Note that q∈ΔX, it's a probability distribution. To solve the maximization, we will be using Lagrange multipliers. We maximize ∑iμiqα−1αi subject to the constraint that ∑iqi−1=0. Thus, we make the function
∑iμiqα−1αi−λ(∑iqi−1)
and set all the partial derivatives w.r.t. the qi and λ to 0, so we get a series of equations of the form:
α−1αμnq−1αn−λ=0
for the various q's, and for λ, we have:
∑iqi−1=0
And then, if we can derive what the various qn and what λ is, we can conclude that those maximize our function. Solving this yields:
qn=μαn∑iμαi
λ=α−1α(∑iμαi)1α
Now, for the partial derivative w.r.t. λ, we have:
∑iqi−1=∑iμαi∑iμαi−1=∑iμαi∑iμαi−1=1−1=0
and for the partial derivative w.r.t. qn, we have:
α−1αμnq−1αn−λ=α−1αμn(μαn∑iμαi)−1α−α−1α(∑iμαi)1α
=α−1α⎛⎝μn(μαn∑iμαi)−1α−(∑iμαi)1α⎞⎠
=α−1α⎛⎜⎝μnμ−1n(∑iμαi)−1α−(∑iμαi)1α⎞⎟⎠
=α−1α⎛⎜⎝1(∑iμαi)−1α−(∑iμαi)1α⎞⎟⎠
=α−1α((∑iμαi)1α−(∑iμαi)1α)=0
So, these are legit. Now, when we substitute these values into our maximization problem, we get:
α1−αln(maxq(∑iμiqα−1αi))
=α1−αln⎛⎝∑nμn(μαn∑iμαi)α−1α⎞⎠
=α1−αln⎛⎜⎝∑nμn⎛⎜⎝μα−1n(∑iμαi)α−1α⎞⎟⎠⎞⎟⎠
=α1−αln⎛⎜⎝∑iμαi(∑iμαi)α−1α⎞⎟⎠
=α1−αln⎛⎜⎝1(∑iμαi)−1α⎞⎟⎠
=α1−αln((∑iμαi)1α)
=11−αln(∑iμαi)=Hα(μ)
Which is the standard formulation of Renyi entropy for a probability distribution.
Proposition 52:limα→∞Hα(h) exists for all infradistributions h, and equals −ln(supqh(q))
Our task will be to bound the difference between
α1−αln(supq(h(qα−1α)))
and
−ln(supq(h(q)))
as α→∞.
Well, for this endeavor, our first task is to bound the distance between the an arbitrary function q and the function qα−1α, for sufficiently large α.
Remember, the distance between these functions is supi|qα−1αi−qi| Now, since all our functions are bounded in [0,1], we're raising a number below 1 to a power slightly below 1, so it actually stays the same or gets larger. So, we don't need the absolute value here.
Thus, to figure out what the largest possible value of this quantity is for a given α, let's let qi be our variable, and α be our constant. Then we differentiate, set to 0 solve the equation for qi and substitute that back in to find out the largest possible value that this quantity can be, and thus the largest possible difference between the functions qα−1α and q, as a function of α.
Differentiating qα−1αi−qi w.r.t. qi and setting it equal to 0, we must solve for qi in the following equation:
α−1αq−1αi−1=0
The value of qi that solves this equation is:
qi=(α−1α)α
as can be seen by substituting this back in. This is our qi which attains the maximum value (no weird stuff going on, as shown by graphing the function), and we're raising a number below 1 to a high power, so it's in [0,1], and thus a permissible function.
Substituting this back into qα−1αi−qi, we get:
=qi(q−1αi−1)
=(α−1α)α(((α−1α)α)−1α−1)
=(α−1α)α((α−1α)−1−1)
=(α−1α)α(αα−1−α−1α−1)
=(α−1α)α(1α−1)
And, this is the maximum possible difference between the function q and qα−1α, as a function of α.
Therefore,
∀q:d(qα−1α,q)≤(α−1α)α(1α−1)≤1α−1
Now, due to Lipschitzness of h, in the α→∞ limit,
|h(q)−h(qα−1α)|
limits to 0 uniformly in q. Therefore, in the α→∞ limit,
supq(h(qα−1α)) limits to supqh(q).
Therefore, ln of these quantities limit to each other. And α1−α limits to -1.
Proposition 48: A function K:X→□Y is an infrakernel and fulfills the property:
∀CX∈K(X):{K(x)|x∈CX} is b-uniform
iff the function K is bounded and continuous when □Y is equipped with the IKR metric.
So, our first direction is to assume that the relevant function is bounded and continuous, and show that it fulfills the infrakernel properties and the one extra compact b-uniformity property. The second direction is to show that, assuming the b-uniformity property and being an infrakernel, the function is bounded and continuous.
Anyways, so let's assume the function K is bounded and continuous. Ie,
supx∈X(supf∈CB−lip(Y)|K(x)(f)|max(Li(f),||f||,1))<∞
And, if xn limits to x, then K(xn) limits to K(x) in Hausdorff-distance/the infra-KR metric.
First, for purposes of disproof, let's assume K doesn't have bounded Lipschitz constant/amount of measure present. That is,
∀λ∃x∈X,(m,b)∈K(x):m(1)≥λ
Then, you can let λ be any number you want, no matter how large, and find some xλ where (mλ,bλ)∈K(xλ) and m(1)≥λ and bλ≥1. Now, let's look at the norm of the infradistribution K(xλ). Plug in the constant function −bλ and we have:
K(xλ)(−bλ)≤mλ(−bλ)+bλ≤−λbλ+bλ
This is because mλ had λ or more measure present. Then, making that substitution, we have:
supx∈X(supf∈Clip|K(x)(f)|max(Li(f),||f||,1))≥supf∈Clip|K(xλ)(f)|max(Li(f),||f||,1)≥|K(xλ)(−bλ)|max(Li(−bλ),||−bλ||,1)
And at this point we observe that the Lipschitz constant of a constant is 0, and the norm of a constant is just its absolute value, and bλ≥1 by assumption, to get:
=|K(xλ)(−bλ)|bλ
Now we use our earlier inequalities and the absolute value (everything in our earlier inequalities is a negative quantity) to get:
≥λbλ−bλbλ=λ−1
But λ can be unboundedly high, so our maximum norm is infinity, which we stipulated isn't the case. Thus, K does have bounded Lipschitz constant (this was derived from K's bounded IKR-norm and continuity).
Now, we just need to derive the other two conditions on an infrakernel from boundedness and continuity of K.
If K is a continuous function X→□Y, and CX is an arbitrary compact subset of X, then K(CX) must be a compact subset of □Y, which, from our characterization of the iff conditions of compactness in Proposition 46, means that it fulfills the shared compact-almost-support property and the b-uniformity condition. Thus the function K fulfills the compact-shared compact-almost-support property, and the compact b-uniformity condition. Finally, for the pointwise-convergence condition of an infrakernel, continuity of the function X→□Y means that if xn limits to x, then K(xn) limits to K(x) in Hausdorff-distance, so by Proposition 45, for all continuous bounded functions f, K(xn)(f) limits to K(x)(f), and this is the last condition we need to call K an infrakernel.
So, bounded and continuous functions X→□Y imply that the function is an infrakernel and fulfills the compact b-uniformity condition. Time for the reverse direction. Let's assume that K is an infrakernel X→□Y and fufills the compact b-uniformity condition. For showing boundedness, there's some Lipschitz constant upper bound λ⊙, and we can go:
supx∈X(supf∈Clip|K(x)(f)|max(Li(f),||f||,1))
≤supx∈X(supf∈Clipλ⊙||f||max(Li(f),||f||,1))
≤supx∈X(supf∈Clipλ⊙)=λ⊙<∞
And we have established boundedness. For continuity, it will take a more detailed argument. Let xn limit to x. Our task is to show that K(xn) limits to K(x) in Hausdorff-distance/infra-KR distance.
Now, the set {xn}n∈N∪{∞} is compact, so by the conditions for an infrakernel, there's a bound on the λ value of the minimal points of the K(xn) infradistributions, and the K(xn) sets have shared compact-almost-supports. We're also assuming the compact b-uniformity condition, so the K(xn) sets have the b-uniformity condition on them. By our necessary and sufficient conditions to be precompact, we have that the set of points {K(xn)}n∈N∪{∞} is precompact in □Y, so we can isolate a convergent subsequence of the sequence of infradistributions K(xn).
What we'll do is show that all convergent subsequences of K(xn) have the same limit point, namely K(x). Here's the argument for it. Index the convergent subsequence with m and have K∞ be the limiting infradistribution of it. From the pointwise convergence condition on an infrakernel,
K(x)(f)=limn→∞K(xn)(f)=limm→∞K(xm)(f)=K∞(f)
Thus, our arbitrarily generated limit point K∞ must perfectly match up with K(x) and be identical to it, so since there's only one limit point, the sequence K(xn) actually converges to K(x). So, because convergent sequences in the input induce convergent sequences in the output, the function K:X→□Y must be continuous, and we've shown the other direction, that infrakernels fulfilling the compact b-uniformity condition must be bounded and continuous.
Proposition 49: The space □X equipped with the IKR-topology is a Polish space if X is too.
So, we've got a metric, the infra-KR distance ie the Hausdorff-distance. We know that the notion of convergence remains independent of which metric on X these metrics are being induced by, so the topology is unchanged, and they're all metrizations. We also know from Proposition 43 that □X is complete under any infra-KR distance ie Hausdorff-distance. So, □X is completely metrizable by any infra-KR distance induced by a complete metric on X, all that remains is to show that it's a separable space, we need to find a countable dense subset.
We know that a single point is compact, and we also know that compactness implies that for all ϵ, there is a b∗ value where dhau(H,Hb∗)≤ϵ by Proposition 46, b-uniformity.
So, the space of infradistributions with the b values of their minimal points having an upper bound is dense in the space of infradistributions. All we need to do now is to have a dense countable subset of the space of infradistributions with upper bounds on the b values of their minimal points.
The way we'll be doing this is taking a dense countable subset of the space of a-measures. There are only countably many ways to choose finitely many points from this subset, and a choice of finitely many a-measures can be taken to define an infradistribution by convex hull, closure and upper-completion. Assuming the fiddly details work out, we'll have countably many infradistributions which are (hopefully) dense in the space of infradistributions with bounded b values, which are dense in the space of infradistributions, so we've got our dense countable subset.
There's still much to do. We need to show that there's a dense countable subset of the space of a-measures at all, we need to deal with technicalities involving normalization to show that we always get an infradistribution by this process, and we need to show that given any infradistribution with bounded b values, we can make an infradistribution of this form arbitrarily close to it in Hausdorff-distance.
For our first part, the cone of a-measures can be viewed as M+(X)×R≥0, and you can also view the space of measures over X as ΔX×R≥0 (a pair of a distribution and a scaling term). This isn't quite right, because there's an identification of everything when the amount of measure is 0, but this doesn't really matter. So, we're trying to find a dense countable subset of
ΔX×R≥0×R≥0
Now, when X is a Polish space, ΔX equipped with the KR-metric is also a Polish space, so there's a dense countable subset. You can take the product of this with the dense countable subset of R≥0 consisting of the rational numbers 0 and above, to get a dense countable subset of the cone of a-measures.
For the normalization technicalities, we need to make sure that whenever we pick finitely many points from our dense countable subset, there's a point with b=0, a point with λ+b=1, and all points have λ+b≥1, in order to have normalization work and your batch of points define an actual infradistribution. This can always be done, throwing out the "bad" choices of finitely many points still leaves you with a countable set.
Finally, given an infradistribution Hb∗ with its minimal points all having b≤b∗, can we make infradistributions arbitrarily close to it in Hausdorff-distance of this form where we just specify finitely many points?
Well, yes. Because of the compact-projection property of infradistribution sets, and truncating at b∗, we can see that all the minimal points of Hb∗ lie in a compact set. So, given any ϵ, we can cover Hb∗ (well, the part with b values below b∗) with ϵ-sized balls centered at its points, get a finite subcover from compactness, and then use denseness of our countable subset to take our finitely many balls with center points and pick something from the dense countable subset in each ball in order to get: A collection of finitely many points from the dense countable subset where each point in Hb∗ with b value below b∗ is ϵ distance away from one of these finitely many points, so the closed convex hull of these finitely many points is only ϵ distance away from Hb∗ (but truncated at b∗), and then when upper completion is done, they stay only ϵ distance away from each other.
So, given any infradistribution with bounded b values, and any ϵ, we can find finitely many points from a dense countable subset of the space of a-measures, which induces an infradistribution that's only ϵ distance away. Because there are only countably many ways to pick finitely many points from a countable set, we have a countable set of infradistributions where any infradistribution has infradistributions of this restricted form arbitrarily close to them, and so we have separability, the last condition we need for Polishness.
Proposition 50: The infra-Vietoris topology and the topology induced by the IKR metrics are the same.
This proof will proceed in three ways. The first is showing that this collection of sets declared-to-be-open in the infra-Vietoris topology is closed under finite intersections, and thus forms a basis (instead of a sub-basis). The second way is taking any basis open set from the basis induced by the IKR metric (we'll be using the strongly equivalent Hausdorff-metric for this) and any point/infradistribution in it, and finding a Infra-Vietoris open set from the basis which is a subset of the open ball in the IKR metric. Finally, we taken any basis open set from the infra-Vietoris topology, and any point/infradistribution in such a set, and find a tiny open ball w.r.t. the IKR metric that fits within the infra-Vietoris open set.
So, here's what we're going to do: First, we're going to show that the intersection of two open sets from the basis is also in the basis. Obviously, we only need to show this for pairwise intersection, as that shows closure under finite intersection.
Consider the open set O1 of infradistributions to be generated by the finite family of bounded-b open sets Oi, and the open set O2 of infradistributions to be generated by the finite family of bounded-b open sets Oj.
Claim: The finite family of bounded-b open sets Oi∩(⋃jOucj) and Oj∩(⋃iOuci), for all the i and j, is a suitable batch of open sets in the a-measures to perfectly replicate the O1∩O2 open set of infradistributions.
The properties for H∈O1∩O2 are:
H⊆⋃iOuci
H⊆⋃jOucj
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃iOi)
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃jOj)
And, the properties for H lying in the set induced by the Oi∩(⋃jOucj) and Oj∩(⋃iOuci) families of open sets (upper completion of open set is open) is:
H⊆⋃i(Oi∩⋃jOucj)uc∪⋃j(Oj∩⋃iOuci)uc
∀i:H∩Oi∩⋃jOucj≠∅
∀j:H∩Oj∩⋃iOuci≠∅
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃i(Oi∩⋃jOucj)∪⋃j(Oj∩⋃iOuci))
So let's get started on showing equivalence of these things. We'll use M for an a-measure, and use a subscript of h,i, or j to denote which sort of set it's in.
First, let's take the conjunction of these two things.
H⊆⋃iOuci
H⊆⋃jOucj
This is equivalent to "for any Mh, it is above some Mi and also above some Mj". Since the only notion of upper completion is adding the +b term, this is equivalent to "for any Mh, we are in one of two cases, either there is a Mi,Mj such that Mh≥Mi≥Mj, or that Mh≥Mj≥Mi"
Similarly,
H⊆⋃i(Oi∩⋃jOucj)uc∪⋃j(Oj∩⋃iOuci)uc
Is saying "for any Mh, we are one of two cases, either there is a Mi,Mj such that Mh≥Mi≥Mj, or vice-versa". And so these two conditions are equivalent.
Also, if you have both of:
∀i:H∩Oi∩⋃jOucj≠∅
∀j:H∩Oj∩⋃iOuci≠∅
Then you trivially have
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
And in the reverse direction, if you have:
H⊆⋃iOuci
H⊆⋃jOucj
∀i:H∩Oi≠∅
∀j:H∩Oj≠∅
then you'll get both of
∀i:H∩Oi∩⋃jOucj≠∅
∀j:H∩Oj∩⋃iOuci≠∅
That just leaves the projection property.
If you have
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃iOi)
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃jOj)
then, let's say (m,0) is your arbitrary point in the closure of the projection of H. This is saying (m,0)≤Mi and (m,0)≤Mj. Again, by how upper completion is just adding b, this is equivalent to "given an arbitary point (m,0) in the closure of the projection of H, either (m,0)≤Mi≤Mj or (m,0)≤Mj≤Mi"
And the condition
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H)⊆prm(⋃i(Oi∩⋃jOucj)∪⋃j(Oj∩⋃iOuci))
is saying "given an arbitary point (m,0) in the closure of the projection of H, either (m,0)≤Mj≤Mi or (m,0)≤Mi≤Mj", which is clearly equivalent. So, we've got our bidirectional equivalence, and any binary (and thus finite) intersection of infra-Vietoris basis opens is an infra-Vietoris basis open, so it is indeed a basis.
Now we must show that this infra-Vietoris topology is identical to the topology induced by the infra-KR metric/Hausdorff distance metric. This can be done by taking any ball in any Hausdorff distance metric centered at a point/infradistribution, and finding an infra-Vietoris open set that contains the same point and is a subset of the open ball. Conversely, we should be able to take any infra-Vietoris open set, and any infradistribution/point within it, and find an IKR open ball centered on that point that lies entirely within the infra-Vietoris open set, showing that the two topologies are equal.
So, first, assuming we've got a Hausdorff-distance of ϵ centered at some infradistribution, can we make an infra-Vietoris set that lies entirely within the ball and engulfs the center point?
Well, there's some b cutoff where the projection of the infradistribution H (cut off at that b value) is within Hausdorff-distance ϵ4 of the projection of the entire infradistribution H to the measure component. Accordingly, we have one of our open sets be a ϵ2-thickening of the chopped-off infradistribution, as this upper completion engulfs the entire infradistribution H within it with ϵ2 room to spare. The rest is done by covering the compact chopped-off portion with all possible open balls of size ϵ4, and using compactness of the chopped-off infradistribution to pick a finite subcover/finitely many open balls. H has the projection of its closure lie entirely within the ϵ2-thickening of the lower part, has nonempty intersection with each open ball, and is a subset of the upper completion of the union of everything.
Now, our task is to show that any infradistribution that lies in the open set induced by that finite collection of open sets lies in the Hausdorff-distance-of-ϵ ball centered at H. So, let our arbitrary new infradistribution H′ be a subset of the union of the upper completion of all these open sets, and have nonempty intersection with each of them, and have its closure lie in the projection of the union of them. Note that the initial open set which engulfs the entire bottom portion has all the other open sets as subsets of it, so the union and upper closure of the opens is just the upper completion of the "big" open set. Any point in H′ must lie in this "big" open set, and it's made by thickening up H (but chopped off) and taking the upper completion, witnessing that the distance from H′ to H is ϵ2 or less.
For the other direction, pick an arbitrary point in H. It's within ϵ4 of the upper completion of one of the finitely many balls used for the open cover. And because those center points are center points of ϵ4-open balls which H′ must have nonempty intersection with, and H′ is upper-complete, it has a point of distance ϵ2 or less from your starting point in H. Thus, anything in this basis open set has Hausdorff-distance of ϵ2 or less from the center point H, so all open balls in the infra-KR metric topology can have their central point engulfed by an infra-Vietoris open set, which is a subset of the initial open ball. So we're half-done, we know that all open sets of the infra-KR metric are open in the infra-Vietoris topology.
In the other direction, we need to take an infra-Vietoris open set, and an infradistribution in it, and find some tiny hausdorff-distance from that arbitrary infradistribution that remains in the infra-Vietoris open set. Let H be your infradistribution, which is a subset of the union of the upper completion of the Oi, and has nonempty intersection with each Oi, and the closure of its projection is a subset of the projections of the Oi.
Now, there's one important fact we have to establish first. It is impossible to have a sequence Mn∈H that gets arbitrarily close to the complement of ⋃iOuci. Fix some sequence Mn∈H and assume that this sequence gets arbitrarily close to the complement of ⋃iOuci.
One of two things can happen. The first thing that can happen is that Mn has a subsequence with bounded b value. Then, we can pick out a convergent subsequence of that subsequence because infradistributions, when cut off at any finite b value, are compact, and take the limit point M∞, which lies in the set H due to closure, and yet has distance 0 from the exterior of ⋃iOuci, which, being the complement of an open set, is closed. So, we've got a point in H and also not in ⋃iOuci, which is impossible.
Therefore, the Mn sequence must have unboundedly large b value as n increases. This projects down to make an mn sequence of measures, in the compact set ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H), and we can isolate a convergent subsequence of those, with limit point m∞. Now, something interesting happens.
For the Mn sequence of a-measures, eventually the b value of them goes well past the largest b value of one of the finitely many bounded-b open sets. And when we go a tiny distance away to a point M∗n which lies outside of ⋃iOuci, then... well, it's got nothing from any of the Oi below it. And nothing from any of the Oi above it either, because the b value of M∗n is too high since it's near Mn. So, M∗n projects down to some measure m∗n which lies outside of prm(⋃iOi). And since M∗n is really close to Mn, mn (in the closure of the projection of H) is really close to the exterior of the projection of ⋃iOi, as witnessed by m∗n. And this is a closed set, because the complement of an open set is closed, and the projection of an open set is open.
This lets us show that our m∞ limit point lies in the complement of prm(⋃iOi), which is impossible, because m∞ lies in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯prm(H) which is a subset of prm(⋃iOi).
So our original assumption was wrong, and there is some δ where any point in H is always δ or more distance away from the exterior of ⋃iOuci. Further, for the Oi, since H∩Oi≠∅, you should be able to pick said point, and find some ϵi-sized open ball you can stick around it that lies entirely within the open set Oi.
Now, a Hausdorff-distance of min(δ,miniϵi)2 around H, all infradistributions that close must lie in the same Vietoris-open set. Here's why. Since all points in your new infradistribution H′ are at most min(δ,miniϵi)2 distance away from a point in H, and given any point in H you need to travel δ distance to get to the exterior of ⋃iOuci, H′ is a subset of that open set as well.
Also, ∀i:H′∩Oi≠∅, because given the central points in the H∩Oi sets, you can travel min(δ,miniϵi)2 distance to get to a point in H′, which lies in the ϵi-sized open ball around said point which still lies entirely within the Oi. So any H′ that close must intersect the same collection of open sets.
Finally, the projection thing is taken care of by the same sort of Hausdorff-distance argument we used to argue that said H′ must lie in the union of upper-completions of the open sets. So, given any Vietoris-open, and any point in it, we can find an IKR-open which engulfs said point and lies entirely in the Vietoris open.
Therefore, the two topologies are the same.
Theorem 3: Given a crisp infrakernel K:Xik→X then if there's a c<1 where, regardless of x and x′, dhau(K(x),K(x′))≤c⋅d(x,x′), there's a unique crisp infradistribution S where K∗(S)=S, and can be found by starting with any crisp infradistribution on X, and repeatedly applying K.
This theorem is going to require showing a lot of fiddly measure-theoretic stuff, and then once everything's all set up, the actual theorem itself falls pretty fast. So, we're going to start off by noting or proving three things needed to make progress.
The first is: "given any measurable function s:X→ΔX, and any probability distribution μ∈ΔX, and any ϵ, we have that there is a continuous function g:X→ΔX s.t. ∫X||g−s||dμ<ϵ".
This is an analogue of the celebrated result that continuous functions are dense in the measurable functions. I'm gonna skip small bits of it, to be reconstructed if you ask. Pretty much, we use the fact that ΔX is separable in the KR-metric, and the results from this pdf to get a few results. ΔX is separable according to the KR-metric, and s has all its output values in that separable space, so it's separably valued. Also s is measurable, so by proposition 1.8 in the paper, s is strongly measurable, so it's the pointwise limit of a sequence of simple functions. The actual proof of this fact goes through Theorem 1.5 in that paper, and looking carefully at Theorem 1.5, the simple functions constructed have their value in ΔX, because the proof is like "fix a dense countable subset of your space of interest, then modify s to be a simple function by rounding off the outputs to the nearest of a finite collection of points from that subset". Proposition 1.10 from that paper says that since s is strongly measurable, then given any μ∈ΔX, s is strongly μ-measurable, and then proposition 1.16 from that paper says that s is Bochner-integrable because ∫X||s(x)||dμ=1<∞, because all the s(x) are probability distributions. Further, part of the proof of proposition 1.16 from that paper shows that the ΔX-valued simple functions, due to their bounded norm, converge to s in the L1 norm by the dominated convergence theorem.
Anyways, at this point, we've shown that any measurable function s:X→ΔX has simple ΔX-valued functions that limit to it, so the simple functions are dense in the measurable functions X→ΔX. Thus, at this point, we just need to show that we can make continuous functions X→ΔX that are arbitrarily close to some simple function χ:X→ΔX.
Let's begin. Our simple functions χ, from looking carefully at the proof of Theorem 1.5 from the linked paper, take the form of partitioning X into finitely many disjoint and covering measurable sets Ai, and mapping each of them to some probability distribution νi∈ΔX. Let n be how many disjoint sets there are.
Thus, our task is to take a simple function χ:X→ΔX, and find a continuous function g:X→ΔX where ∫X||χ(x)−g(x)||dμ<ϵ If we can do this for any ϵ, then it will show that continuous functions are dense in the simple functions which are dense in the measurable functions in L1.
So, here's what you do. Every probability distribution over a Polish space has the property that the measure of any measurable set A can be approximated from below by the measure of a compact subset C. We have finitely many disjoint measurable sets Ai, and we associate each of them with a compact subset Ci where μ(Ai/Ci)<ϵn.
Let's abuse notation a bit and have d(x,Ci) be the minimum distance from x to a point in Ci. Then, let g be defined as:
g(x):=∑i≤n⎛⎝1d(x,Ci)∑j1d(x,Cj)νi⎞⎠
Admittedly, the distance may be 0, leading to ∞∞, however, this will be considered to be 1. The reason this works and is continuous is because all the Ci are bounded away from each other. This is because they're all subsets of Ai which are disjoint, so all the Ci are disjoint, and any two compact sets which are disjoint must be bounded away from each other, because otherwise you could fix a sequence in one compact set getting closer and closer to the other compact set, isolate a convergent subsequence, and it'd limit to be distance 0 from the other compact set, and so in it, contradicting disjointness. Now, we can go:
∫X||g(x)−χ(x)||dμ
=∑i≤n(∫Ai||g(x)−χ(x)||dμ)
=∑i≤n(∫Ci||g(x)−χ(x)||dμ+∫Ai/Ci||g(x)−χ(x)||dμ)
And then, for the next step, we observe that when x∈Ci, χ is like "well it's in Ai, I'll return νi as my output", and g is like "well, it's distance 0 from Ai, so I turn into ∞∞νi=νi by convention". Further, g always returns probability distributions, and the maximum KR-distance two probability distributions can have (when the original space X has its metric bounded above by 1, which can always be imposed without altering the topology) is 1. So, we get
≤∑i≤n(∫Ci||νi−νi||dμ+∫Ai/Ci1dμ)
=∑i≤n(∫Ai/Ci1dμ)=∑i≤n(μ(Ai/Ci))
<∑i≤nϵn=ϵ
And we're done. We showed that any measurable function s:X→ΔX could be arbitrarily closely approximated in L1 by a simple function χ:X→ΔX, and that any simple function could be arbitrarily closely approximated in L1 by a continuous function g:X→ΔX.
The next fact we'll need is that, when X has its distance metric bounded above by 1, which can always be done without altering the topology, then the KR-distance on probability distributions is exactly equal to the Earthmover distance, the first Wasserstein metric.
The third and final fact we need is that, if you have an function K:X→K(ΔX) (function to the space of compact subsets of ΔX), which is continuous w.r.t. the Hausdorff metric, and always produces compact convex sets, then the function ψ:X×X→K(ΔX) (defined relative to some continuous function g:X→ΔX and some ϵ), defined by:
ψ(x,x′):=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{ν∈K(x′)|d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ}
Is lower-hemicontinuous.
The requirement for being lower-hemicontinuous is if you have some sequence (xn,x′n) limiting to (x,x′), and some ν∈ψ(x,x′), then you should be able to find some sequence of points νn∈ψ(xn,x′n) where νn limits to ν.
First, we'll need that, if x′n limits to x′, then ⋃n∈N∪{∞}K(x′n) is a compact set. We know that K is continuous and {x′n}n∈N∪{∞} is compact, so Lemma 3 steps in to show that said set is compact.
The relevant implication of this, when paired with Hausdorff-continuity of K, is that you can take any sequence of distributions νn∈K(x′n), and find a convergent subsequence that converges to something in K(x′).
The second piece for this argument we'll need is that:
limn→∞(infν′∈K(x′n)d(ν′,g(xn)))=infν′∈K(x′)d(ν′,g(x))
So, first, we observe that since all the K(x′n) are compact, we can find actual minimizers νminn. We don't know that the limit actually exists, so pass to a subsequence indexed by m where we're taking the liminf. Pass to another subsequence where the νminm actually converge to a point in K(x′), which we can do by the immediately preceding point about compactness. Then,
liminfn→∞(infν′∈K(x′n)d(ν′,g(xn)))=limm→∞(infν′∈K(x′m)d(ν′,g(xm)))
=limm→∞d(νminm,g(xm))=d(νmin,g(x))≥infν′∈K(x′)d(ν′,g(x))
This was done by g being continuous, so g(xm) converges to g(x), and νmin∈K(x′).
Second, fix some minimizer νmin∈K(x′), and by Hausdorff-continuity of K, we can find a sequence νn∈K(x′n) which limits to it. Then,
limsupn→∞(infν′∈K(x′n)d(ν′,g(xn)))≤limsupn→∞d(νn,g(xn))
=limn→∞d(νn,g(xn))=d(νmin,g(x))=infν′∈K(x′)d(ν′,g(x))
And at this point, we can go:
liminfn→∞(infν′∈K(x′n)d(ν′,g(xn)))≥infν′∈K(x′)d(ν′,g(x))≥limsupn→∞(infν′∈K(x′n)d(ν′,g(xn)))
So, we have
limn→∞(infν′∈K(x′n)d(ν′,g(xn)))=infν′∈K(x′)d(ν′,g(x))
Alright, that's part two of the argument for part 3 of the preliminaries for this theorem down. Time for part 3 of the argument. We'll assume that xn limits to x, x′n limits to x′, and pick an arbitrary ν s.t. ν∈K(x′), and
d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ
Let νn∈K(x′n) be some point in K(x′n) that's as close as possible to ν. Obviously, these limit to ν, due to Hausdorff-continuity of K. Non-obviously, a tail of them (all but finitely many), will lie in ψ(xn,x′n). Here's why.
Let δ be:
infν′∈K(x′)d(ν′,g(x))+ϵ−d(ν,g(x))
There is some finite n by which you are guaranteed three things forever afterwards.
First, dhau(K(x′n),K(x′))<δ3, because K is Hausdorff-continuous and x′n limits to x′.
Second, d(g(xn),g(x))<δ3, because g is continuous and xn limits to x.
Third,
infν′∈K(x′)d(ν′,g(x))<infν′∈K(x′n)d(ν′,g(xn))+δ3
Because we know that this minimum-distance quantity limits to the minimum-distance quantity of the limit, we showed it above. Now, at this point, we can go:
d(νn,g(xn))≤d(νn,ν)+d(ν,g(x))+d(g(x),g(xn))
≤dhau(K(x′n),K(x′))+d(ν,g(x))+d(g(x),g(xn))
<d(ν,g(x))+2δ3
=(infν′∈K(x′)d(ν′,g(x))+ϵ−δ)+2δ3
<((infν′∈K(x′n)d(ν′,g(xn))+δ3)+ϵ−δ)+2δ3
=infν′∈K(x′n)d(ν′,g(xn))+ϵ
This was splitting the distance up, observing that νn was selected to be as close as possible to ν, so it'd be as close or closer than the Hausdorff distance between the sets they were selected from, using that two of the distances were small, using how δ was defined, and using our bound on the minimum-distance quantity. Anyways, this shows that forever after some timestep,
d(νn,g(xn))<infν′∈K(x′n)d(ν′,g(xn))+ϵ
And therefore, since
ψ(x,x′):=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{ν∈K(x′)|d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ}
We have that
νn∈ψ(xn,x′n)
Time for our final part! Showing lower-hemicontinuity of ψ requires showing that, for all sequences (xn,x′n) limiting to (x,x′), and all ν∈ψ(x,x′), we can find a sequence νn∈ψ(xn,x′n) limiting to ν. We just showed that for ν fulfilling the defining inequality. But we took the closure, didn't we? What if our ν was added in the closure? Then we can take a sequence νm∈ψ(x,x′) (but without the closure part) that does fulfill the defining inequality, and do a thing where we make a sequence νn,m∈ψ(xn,x′n) which limits to νm, one sequence for each m. Then, we basically proceed for a while in νn,0 until we have a guarantee that νn,1 will remain close to ν1, and switch to picking from the sequence νn,1, and repeat, slowly ratcheting up the second index over time. Said sequence will get arbitrarily close to the νm values, which get arbitrarily close to ν itself, so this "step up the sequence index when you've waited long enough" sequence can limit to ν
And with this argument, we know that ψ is lower-hemicontinuous.
Alright, now we can start embarking on the true proof instead of just setting everything up. Remember, we assumed that there was a crisp infrakernel K:X→ΔX, all the K(x) have compact sets of minimal points (by CAS), and that there was some fixed c<1 where dhau(K(x)min,K(x′)min)≤cd(x,x′) regardless of x and x′, and our goal is to show that there's a unique stationary crisp infradistribution which can be found by selecting any crisp infradistribution and repeatedly iterating K.
There's a bit of setup to do first, but thankfully it isn't as demanding as we've done before. First, we'll abuse notation a bit by using H and H′ and K(x) and K(x′) to refer, not to the infradistribution as a whole, but to their compact convex sets of minimal points, ie, sets of probability distributions.
Second, we recall from the proof of proposition what the form for K∗(H) (the pushforward of H) is. Well, actually, the set of minimal points. It is:
¯¯¯¯¯¯¯¯c.h(⋃μ∈HEμ[K(x)])
Ie, the closed convex hull of the union of mixtures of the sets of minimal points of K(x). Mixing sets with regards to a probability distribution is done by fixing a measurable selection function s:X→ΔX s.t. for all x, s(x)∈K(x). Eμ[s(x)] is a probability distribution, and then you just union all those points together for all the possible selection functions to make your set Eμ[K(x)].
Our final thing to recall (well, actually, look up from Wikipedia), is that since the Earthmover distance/1st Wasserstein metric/KR-distance is all the same, we'll be using an alternate formulation of distance. If μ,μ′ are probability distributions over X, then
d(μ,μ′)=infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′
Ie, of all probability distributions over X×X, restrict to those which make μ and μ′ when you project it down to the appropriate coordinates, and try to pick one that makes the integral of the distance function as low as possible. The lowest integral of distance you can get is the earthmover distance. You can check Wikipedia for this formulation. We'll be using this form once.
Let's begin. First, by the form of the set of minimal points of the pushforward,
dhau(K∗(H),K∗(H′))
=dhau(¯¯¯¯¯¯¯¯c.h(⋃μ∈HEμ[K(x)]),¯¯¯¯¯¯¯¯c.h(⋃μ′∈H′Eμ′[K(x)]))
And then, we observe that taking the closure doesn't affect the Hausdorff distance at all, so we can go:
=dhau(c.h(⋃μ∈HEμ[K(x)]),c.h(⋃μ′∈H′Eμ′[K(x)]))
≤dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])
The reason for that inequality is that convex hull doesn't increase the Hausdorff distance. If you've got a finite mix of points from ⋃μ∈HEμ[K(x)] in the convex hull, then you could just take each of those, find the closest point from the analogue for H′, and mix those together to make a point close to your thing added in the convex hull, so distance doesn't go up when you do convex hull.
At this point, we have the inequality:
dhau(K∗(H),K∗(H′))≤dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])
Our mission is to impose a suitable bound on the second term, and once we've done that, show that K∗ has type K(ΔX)→K(ΔX) (maps compact sets of probability distributions to cmpact sets of probability distributions), and then we can apply the Banach fixpoint theorem on K(ΔX) to get a fix-set of probability distributions that doesn't alter when you apply K∗ (pushforward), and can be found by iteration.
For our first part, imposing a bound on
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])
We shall do the following. We will pick a point in the first set, and find a nearby point in the second set. A point from the first set must be associated with some μ∈ΔX, and some measurable selection function s:X→ΔX where, for all x, s(x)∈K(x). Said point is written as Eμ[s(x)], or alternatively, as ∫Xs(x)dμ. Our task is to find a point in the second set.
Construct it as follows: Select μ′∈H′ to be as close to μ as you can get. Pick an ϵ. Then, pick some ξ∈Δ(X×X) s.t. the projection of ξ to the first coordinate is μ and the projection to the second coordinate is μ′, and:
∫X×Xd(x,x′)dξ<infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ
Such a ξ can always be found. Pick a continuous function g:X→ΔX such that
∫X||s(x)−g(x)||dμ<ϵ
We can always pick such a function because we've already shown that continuous functions are dense in the measurable functions.
At this point, recall our function ψ:X×X→K(ΔX) defined as:
ψ(x,x′):=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{ν∈K(x′)|d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ}
We know that said function will be lower-hemicontinuous. Further, it always produces closed sets (we explicitly took the closure), and they're nonempty (just look at it a bit), and are convex. Convexity is a bit more involved, and it occurs because a mix of points a certain distance or less from some set point (g(x) in this case), will also be the same or less distance away from said point, and the K(x′) are all convex. At this point, we can invoke the Michael selection theorem to get a continuous function g′:X×X→ΔX where, for all x,x′, g′(x,x′)∈ψ(x,x′).
At this point, remember our distribution ξ∈Δ(X×X) which projects down to μ′? Well, by the disintegration theorem (the usual probability theory version), ξ can be written as μ′⋉k′ where k′(x′)∈ΔX gives the conditional probability distribution over X if you see some x′.
And now we can finally define our measurable selection function s′, it is defined as:
s′(x′)=∫x∈Xg′(x,x′)dk′(x′)
And, our final point that we will pick to be close to Eμ[s(x)] will be: Eμ′[s′(x′)].
Uh... Is s′ even a legit selection function? What we need is that for all x′, s′(x′)∈K(x′). Well, fear not.
g′(x,x′)∈ψ(x,x′)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{ν∈K(x′)|d(ν,g(x))<infν′∈K(x′)d(ν′,g(x))+ϵ}
And, because K(x′) is closed and we took the closure of a batch of points within it, we have:
g′(x,x′)∈ψ(x,x′)⊆K(x′)
And, because K(x′) is closed and compact, making s′(x′) by integrating over a bunch of different choices of x to plug in just makes the average of a batch of points within K(x′), which lies in K(x′), so we're good, s′(x′)∈K(x′) regardless of x′ and it's a legit selection function. Now, we can get into inequality shuffling!
d(Eμ[s(x)],Eμ′[s′(x′)])
=∣∣∣∣∫Xs(x)dμ−∫Xs′(x′)dμ′∣∣∣∣
≤∣∣∣∣∫Xs(x)dμ−∫Xg(x)dμ∣∣∣∣+∣∣∣∣∫Xg(x)dμ−∫Xs′(x′)dμ′∣∣∣∣
=∣∣∣∣∫Xs(x)dμ−∫Xg(x)dμ∣∣∣∣+∣∣∣∣∫Xg(x)dμ−∫x′∈X(∫x∈Xg′(x,x′)dk′(x′))dμ′∣∣∣∣
And then we can notice something really interesting. We can rewrite that final integral as
Eμ′[λx′.Ek′(x′)(λx.g′(x,x′))]
Look familiar? That's just the semidirect product!! It's the exact same as
Eμ′⋉k′[g′(x,x′)]
And, by our construction, μ′⋉k′=ξ. So, writing this as an integral, we have:
=∣∣∣∣∫Xs(x)dμ−∫Xg(x)dμ∣∣∣∣+∣∣∣∣∫Xg(x)dμ−∫X×Xg′(x,x′)dξ∣∣∣∣
Also, because ξ projects to μ in the first coordinate, we can write our integral over μ as an integral over ξ instead, it doesn't change the value.
=∣∣∣∣∫Xs(x)dμ−∫Xg(x)dμ∣∣∣∣+∣∣∣∣∫X×Xg(x)dξ−∫X×Xg′(x,x′)dξ∣∣∣∣
=∣∣∣∣∫X(s(x)−g(x))dμ∣∣∣∣+∣∣∣∣∫X×X(g(x)−g′(x,x′))dξ∣∣∣∣
And move the norm inside to get:
≤∫X||s(x)−g(x)||dμ+∫X×X||g(x)−g′(x,x′)||dξ
We know that s and g are close w.r.t. μ, so we can get:
<ϵ+∫X×X||g(x)−g′(x,x′)||dξ
Now, we should observe something interesting. Because g′(x,x′)∈ψ(x,x′), and using how ψ(x,x′) was defined, we have that:
d(g′(x,x′),g(x))≤infν′∈K(x′)d(ν′,g(x))+ϵ
Using the equivalence between distance of two points and the norm of subtracting the two points, we have:
≤ϵ+∫X×X(infν′∈K(x′)||ν′−g(x)||+ϵ)dξ
=∫X×X(infν′∈K(x′)||ν′−g(x)||)dξ+2ϵ
And then we can use another distance inequality to go:
≤∫X×X(infν′∈K(x′)(||ν′−s(x)||+||s(x)−g(x)||))dξ+2ϵ
And now, since the latter term isn't affected by the inf, we can pull it out, and get:
=∫X×X(infν′∈K(x′)||ν′−s(x)||)dξ+∫X×X||s(x)−g(x)||dξ+2ϵ
Again, the projection of ξ to the first coordinate is μ, so we have:
=∫X×X(infν′∈K(x′)||ν′−s(x)||)dξ+∫X||s(x)−g(x)||dμ+2ϵ
And, because s and g are close w.r.t μ, we have:
<∫X×X(infν′∈K(x′)||ν′−s(x)||)dξ+3ϵ
And then... s(x)∈K(x), and ν′ is as close as possible within K(x′), so we have:
≤∫X×Xdhau(K(x),K(x′))dξ+3ϵ
And we know that, as one of our starting assumptions, dhau(K(x),K(x′))≤cd(x,x′) where c is a constant below 1. So,
≤∫X×Xcd(x,x′)dξ+3ϵ
=c∫X×Xd(x,x′)dξ+3ϵ
But wait, we selected \xi such that:
∫X×Xd(x,x′)dξ<infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ
So, making that substitution, we have:
<c(infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′+ϵ)+3ϵ
<c(infξ′∈Γ(μ,μ′)∫X×Xd(x,x′)dξ′)+4ϵ
And this is just the definition of earthmover distance, so we have:
=cd(μ,μ′)+4ϵ
And, μ′ was selected from H′ to be as close as possible to μ, so we've got one last inequality.
≤cdhau(H,H′)+4ϵ
Our final inequality, putting all this together, is that
d(Eμ[s(x)],Eμ′[s′(x′)])<cdhau(H,H′)+4ϵ
So, given an arbitrary point in H, we can craft a point in H′ that isn't too far away, and switch H′and H to get:
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])<cdhau(H,H′)+4ϵ
But wait, ϵ was arbitrary! So we can shrink it to 0 to get:
dhau(⋃μ∈HEμ[K(x)],⋃μ′∈H′Eμ′[K(x)])≤cdhau(H,H′)
Putting this bound together with what we were working on at the start,
dhau(K∗(H),K∗(H′))≤cdhau(H,H′)
Or, alternately,
dhau(K∗(H),K∗(H′))dhau(H,H′)≤c<1
So, K∗ is definitely a contraction mapping. But we do need to check that if H is a compact set of probability distributions, then K∗(H) is compact as well. Fortunately, this is doable because the pushforward of an infradistribution is an infradistribution, and you can chop off an infradistribution at any +b level and get a compact set of a-measures. So if we chop it off at 0, we get exactly what we want.
The Hausdorff-limit of a series of convex compact sets is convex compact, and we know that for all H,H′ that are compact convex,
dhau(K∗(H),K∗(H′))dhau(H,H′)≤c<1
So, K∗ is a contraction mapping on this closed complete metric space (the space of compact subsets of a complete space is complete), and we can apply the Banach fixpoint theorem to find a unique fixset S s.t. S=K∗(S). S is a convex compact set of probability distributions, so it corresponds to a crisp infradistribution that can be found by taking any crisp infradistribution and iterating K∗ (the pushfoward), and we're done!
Proposition 51: If h is the infradistribution corresponding to a probability distribution μ, then if α∈(1,∞), Hα(h)=Hα(μ), where the latter Hα is the conventional definition of Renyi entropy for probability distributions, and the former Hα is the definition of Renyi Entropy for infradistributions.
For an infradistribution, the Renyi entropy would be:
Hα(h)=α1−αln(maxqh(qα−1αi))
But it's a probability distribution corresponding to μ, so the h turns into an expectation.
=α1−αln(maxq(∑iμiqα−1αi))
Note that q∈ΔX, it's a probability distribution. To solve the maximization, we will be using Lagrange multipliers. We maximize ∑iμiqα−1αi subject to the constraint that ∑iqi−1=0. Thus, we make the function
∑iμiqα−1αi−λ(∑iqi−1)
and set all the partial derivatives w.r.t. the qi and λ to 0, so we get a series of equations of the form:
α−1αμnq−1αn−λ=0
for the various q's, and for λ, we have:
∑iqi−1=0
And then, if we can derive what the various qn and what λ is, we can conclude that those maximize our function. Solving this yields:
qn=μαn∑iμαi
λ=α−1α(∑iμαi)1α
Now, for the partial derivative w.r.t. λ, we have:
∑iqi−1=∑iμαi∑iμαi−1=∑iμαi∑iμαi−1=1−1=0
and for the partial derivative w.r.t. qn, we have:
α−1αμnq−1αn−λ=α−1αμn(μαn∑iμαi)−1α−α−1α(∑iμαi)1α
=α−1α⎛⎝μn(μαn∑iμαi)−1α−(∑iμαi)1α⎞⎠
=α−1α⎛⎜⎝μnμ−1n(∑iμαi)−1α−(∑iμαi)1α⎞⎟⎠
=α−1α⎛⎜⎝1(∑iμαi)−1α−(∑iμαi)1α⎞⎟⎠
=α−1α((∑iμαi)1α−(∑iμαi)1α)=0
So, these are legit. Now, when we substitute these values into our maximization problem, we get:
α1−αln(maxq(∑iμiqα−1αi))
=α1−αln⎛⎝∑nμn(μαn∑iμαi)α−1α⎞⎠
=α1−αln⎛⎜⎝∑nμn⎛⎜⎝μα−1n(∑iμαi)α−1α⎞⎟⎠⎞⎟⎠
=α1−αln⎛⎜⎝∑iμαi(∑iμαi)α−1α⎞⎟⎠
=α1−αln⎛⎜⎝1(∑iμαi)−1α⎞⎟⎠
=α1−αln((∑iμαi)1α)
=11−αln(∑iμαi)=Hα(μ)
Which is the standard formulation of Renyi entropy for a probability distribution.
Proposition 52: limα→∞Hα(h) exists for all infradistributions h, and equals −ln(supqh(q))
Our task will be to bound the difference between
α1−αln(supq(h(qα−1α)))
and
−ln(supq(h(q)))
as α→∞.
Well, for this endeavor, our first task is to bound the distance between the an arbitrary function q and the function qα−1α, for sufficiently large α.
Remember, the distance between these functions is supi|qα−1αi−qi|
Now, since all our functions are bounded in [0,1], we're raising a number below 1 to a power slightly below 1, so it actually stays the same or gets larger. So, we don't need the absolute value here.
Thus, to figure out what the largest possible value of this quantity is for a given α, let's let qi be our variable, and α be our constant. Then we differentiate, set to 0 solve the equation for qi and substitute that back in to find out the largest possible value that this quantity can be, and thus the largest possible difference between the functions qα−1α and q, as a function of α.
Differentiating qα−1αi−qi w.r.t. qi and setting it equal to 0, we must solve for qi in the following equation:
α−1αq−1αi−1=0
The value of qi that solves this equation is:
qi=(α−1α)α
as can be seen by substituting this back in. This is our qi which attains the maximum value (no weird stuff going on, as shown by graphing the function), and we're raising a number below 1 to a high power, so it's in [0,1], and thus a permissible function.
Substituting this back into qα−1αi−qi, we get:
=qi(q−1αi−1)
=(α−1α)α(((α−1α)α)−1α−1)
=(α−1α)α((α−1α)−1−1)
=(α−1α)α(αα−1−α−1α−1)
=(α−1α)α(1α−1)
And, this is the maximum possible difference between the function q and qα−1α, as a function of α.
Therefore,
∀q:d(qα−1α,q)≤(α−1α)α(1α−1)≤1α−1
Now, due to Lipschitzness of h, in the α→∞ limit,
|h(q)−h(qα−1α)|
limits to 0 uniformly in q. Therefore, in the α→∞ limit,
supq(h(qα−1α)) limits to supqh(q).
Therefore, ln of these quantities limit to each other. And α1−α limits to -1.
Accordingly, in the α→∞ limit,
α1−αln(supq(h(qα−1α)))
limits to
−ln(supq(h(q)))
and so, we can define the Renyi entropy for α=∞.