A change in terminology: It is convenient when important concepts have short names. The concept of an "optimal predictor scheme" seems much more important than its historical predecessor, the "optimal predictor". Therefore "optimal predictor schemes" will be henceforth called just "optimal predictors" while the previous concept of "optimal predictor" might be called "flat optimal predictor".
We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where self-prediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.
The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.
Results
Notation
Given sets X and Y, X→Y will denote the set of mappings from X to Y.
Before taking on reflection, we introduce a stronger concept of optimal predictor, to which the previous existence theorems still apply.
Definition 1
Let r be a positive integer. A proto-error space of rank r is a set E of bounded functions from Nr to R≥0 s.t.
(i) If δ1,δ2∈Δ then δ1+δ2∈E.
(ii) If δ1∈Δ and δ2≤δ1 then δ2∈E.
(iii) There is a polynomial h:Nr→R s.t. 2−h∈E.
Proposition 1
If E is a proto-error space of rank r and α∈R>0, then Eα:={δα∣δ∈E} is also a proto-error space of rank r.
Proposition 2
If E is a proto-error space of rank r, α,γ∈R>0 and α<γ then Eγ⊆Eα.
Definition 3
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P a (poly,log)-predictor.
^P is called an E(poly,log)-optimal predictor for (f,μ) when for any (poly,log)-predictor ^Q, there is δ∈E s.t.
^P is called an E∗(poly,log)-optimal predictor for (f,μ) when there is α>0 s.t. ^P is an Eα(poly,log)-optimal predictor for (f,μ).
E∗(poly,log)-optimal predictors have properties closely parallel to Δ(poly,log)-optimal predictors. The corresponding theorems are listed in Appendix A. Most theorems are given without proof, as the proofs arecloselyanalogoustobefore, with the exception of Theorem A.4 which is proven in Appendix B.
We now consider a generalization in which the advice is allowed to be random.
Definition 4
Given appropriate sets X and Y, consider Q:N2×X×{0,1}∗2alg−→Y, rQ:N2→N polynomial and {σkjQ:{0,1}∗→[0,1]}k,j∈N a family of probability measures. The triple ^Q=(Q,rQ,σQ) is called (poly,rlog)-bischeme of signature X→Y when
(i) The runtime of Qkj(x,y,z) is bounded by p(k,j) with p polynomial.
(ii) suppσkjQ⊆{0,1}≤c⌊log(k+2)+log(j+2)⌋ for some c∈N.
We will use the notation ^Qkj(x,y) to stand for the obvious Y-valued σkjQ-random variable and the notation ^Qkj(x) to stand for the obvious Y-valued UrQ(k,j)×σkjQ-random variable.
A (poly,rlog)-bischeme of signature {0,1}∗→Y will also be called a Y-valued (poly,rlog)-bischeme.
A [0,1]-valued (poly,rlog)-bischeme will also be called a (poly,rlog)-predictor.
Note 1
Conceptually advice corresponds to precomputation which might be expensive or even "hyperprecomputation". Random advice corresponds to random precomputation. Random logarithmic advice can be always replaced by deterministic polynomial advice at the cost of a small rounding error.
Definition 5
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P a (poly,rlog)-predictor. ^P is called an E(poly,rlog)-optimal predictor for (f,μ) when for any (poly,rlog)-predictor ^Q, there is δ∈E s.t.
The concept of an E(poly,rlog)-optimal predictor is essentially of the same strength as an E(poly,rlog)-optimal predictors, as seen in the following two Propositions.
Proposition 3
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P an E(poly,log)-optimal predictor. Then it defines a E(poly,rlog)-optimal predictor by setting σkjP(akjP)=1.
Proposition 4
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. If (f,μ) admits a E(poly,rlog)-optimal predictor then it admits a E(poly,log)-optimal predictor.
We are now ready to introduce the key abstraction.
Definition 6
Given a set Σ, denote ΠΣ:=Σ×N2→{0,1}∗×N equipped with the product topology.
A reflective system is a triple (Σ,f,μ) where Σ is a set, {μkn:{0,1}∗→[0,1]}n∈Σ,k∈N is a collection of probability measures where we regard each μn as a word ensemble and {fn:suppμn×ΠΣ→[0,1]}n∈Σ is a collection of continuous functions (here suppμn has the discrete topology or, equivalently, we only require continuity in the second variable).
The motivation behind this definition is regarding ΠΣ as the space of possible predictor programs, where the first factor of {0,1}∗×N is the program itself (including advice) and the second factor is the number of intrinsic coin flips. Thus the fn represent a system of computations in which each has access to the source code of predictors for the entire system.
Definition 7
Consider a reflective system R=(Σ,f,μ) and a collection of (poly,rlog)-predictors {^Qn=(Qn,rn,σn)}n∈Σ.
Denote AΣ:=Σ×N2→{0,1}∗ equipped with the product σ-algebra. Denote σ:=∏n,k,jσkjn, a probability measure on AΣ.
Given n∈Σ, k,j∈N and a∈{0,1}∗ denote Qkjn[a]∈{0,1}∗ to be the program that computes Qkjn(x,y,a) given input x,y∈{0,1}∗. Given a∈AΣ, define ^Q[a]∈ΠΣ by ^Q[a]kjn:=(Qkjn[akjn],rn(k,j)).
Given n∈Σ, R[^Q]n:suppμn→[0,1] is defined as follows
R[^Q]n(x):=Eσ[fn(x,^Q[a])]
The expectation value is well-defined thanks to the continuity of fn.
Definition 8
Fix E a proto-error space of rank 2 and R=(Σ,f,μ) a reflective system. A collection {Pn}n∈Σ of (poly,rlog)-predictors is called an E(poly,rlog)-optimal predictor system for R when for any n∈Σ, Pn is a E(poly,rlog)-optimal predictor for (R[P]n,μn).
Construction 1
Given ϕ∈Φ, denote E1(ϕ) the set of bounded functions δ:N→R≥0 s.t.
∀ϵ∈(0,1):limk→∞ϕ(k)1−ϵδ(k)=0
Given ϕ∈Φ, denote E2(ll,ϕ) the set of bounded functions δ:N2→R≥0 s.t.
∀ψ∈Φ:ψ≤ϕ⟹Eλkψ[δ(k,j)]∈E1(ψ)
Denote
E2(ll):=⋂ϕ∈ΦE2(ll,ϕ)
Proposition 5
If ϕ∈Φ is s.t. ∃n:limk→∞2−knϕ(k)=0, E1(ϕ) is a proto-error space.
For any ϕ∈Φ, E2(ll,ϕ) is a proto-error space.
E2(ll) is a proto-error space.
Note 2
E∗2(ll)-optimality is strictly stronger than Δ2ll-optimality. We exploit this strength in Theorem 2 below which is the main motivation for introducing E-optimality at this point.
Theorem 1 (general existence theorem)
Any reflective system has a E2(ll)(poly,rlog)-optimal predictor system.
We also prove a more restricted existence theorem which allows deterministic advice.
Definition 9
Consider a set Σ and {μkn:{0,1}∗→[0,1]}n∈Σ,k∈N a collection of probability measures. Denote Dμ:={(n∈Σ,k∈N,j∈N,x∈{0,1}∗)∣μkn(x)>0}. Denote Ωμ:=Dμ→[0,1].
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. A ϕ-reflective system is a pair (f,μ) where μ is as above and {fn:suppμn×Ωμ→[0,1]}n∈Σ is a collection of functions s.t. there are collections {ψnm∈Φ}n,m∈Σ, {αn∈(0,1]}n∈Σ, {cn∈R>0}n∈Σ and probability measures {ρn:Σ→[0,1]}n∈Σ satisfying
The condition on f is a kind of Hoelder condition, uniform over k.
Proposition 6
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. For any ϕ-reflective system (f,μ) and n∈Σ, fn is continuous with respect to the product topology on Ωμ.
Definition 10
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ.
Given m∈N, x,y1,y2…ym∈{0,1}∗, we let ev(x,y1,y2…ym)∈{0,1}≤ω stand for the evaluation of program x on inputs y1,y2…ym without a time limit (we assume that on the output tape the machine head moves right iff it produces a symbol in {0,1} and cannot be moved left). We also extend β to be defined on {0,1}≤ω in the obvious way.
We define exμ:ΠΣ→Ωμ by exμ(π)kjn(x):=EU(πkjn)2[β(ev((πkjn)1,x,y))].
Consider a ϕ-reflective system R=(f,μ) and a collection of (poly,rlog)-predictors {^Qn=(Qn,rn,σn)}n∈Σ. Given n∈Σ, R[^Q]n:suppμn→[0,1] is defined as follows
R[^Q]n(x):=Eσ[fn(x,exμ(^Q[a]))]
The expectation value is well-defined due to Proposition 5.
Definition 11
Fix a set Σ, {En}n∈Σ a collection of proto-error spaces of rank 2, a collection {ϕn∈Φ}n∈Σ and R=(f,μ) a ϕ-reflective system. A collection {Pn}n∈Σ of (poly,log)-predictors is called an E∗(poly,log)-optimal predictor system for R when for any n∈Σ there is γ∈R>0 s.t. Pn is an Eγn(poly,log)-optimal predictor for (R[P]n,μn).
Theorem 2
Consider a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Any ϕ-reflective system has an E∗2(ll,ϕ)(poly,log)-optimal predictor system.
Appendix A
Definition A.1
E, a proto-error space of rank r, is called ample when there is a polynomial h:Nr→R>0 s.t. 1h∈E.
Fix E, a proto-error space of rank 2.
Theorem A.1
Assume E is ample. Consider (f,μ) a distributional estimation problem, ^P an E(poly,rlog)-optimal predictor for (f,μ) and {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N. Define γ:N2→R>0 by
γ(k,j):=Prμk×UrP(k,j)×σkjP[pkj≤^Pkj≤qkj]−1
Define
ϕkj:=Eμk×UrP(k,j)×σkjP[f−^Pkj∣pkj≤^Pkj≤qkj]
Assume that either pkj,qkj have a number of digits logarithmically bounded in k,j or ^Pkj produces outputs with a number of digits logarithmically bounded in k,j. Then, |ϕ|∈(γE)12.
Theorem A.2
Consider μ a word ensemble and f1,f2:suppμ→[0,1] s.t. f1+f2≤1. Suppose ^P1 is an E∗(poly,rlog)-optimal predictor for (f1,μ) and ^P2 is an E∗(poly,rlog)-optimal predictor for (f2,μ). Then, ^P1+^P2 is an E∗(poly,rlog)-optimal predictor for (f1+f2,μ).
Theorem A.3
Consider μ a word ensemble and f1,f2:suppμ→[0,1] s.t. f1+f2≤1. Suppose ^P1 is an E∗(poly,rlog)-optimal predictor for (f1,μ) and ^P2 is an E∗(poly,rlog)-optimal predictor for (f1+f2,μ). Then, ^P2−^P1 is an E∗(poly,rlog)-optimal predictor for (f2,μ).
Definition A.2
Given E a proto-error space of rank r, the associated error space is E1∞:=⋃ϵ>0Eϵ.
Theorem A.4
Consider (f1,μ1), (f2,μ2) distributional estimation problems with respective E∗(poly,rlog)-optimal predictors ^P1 and ^P2. Assume μ1 is E1∞(log)-sampleable and (f2,μ2) is E1∞(log)-generatable. Then, ^P1×^P2 is an E∗(poly,rlog)-optimal predictor for (f1×f2,μ1×μ2).
Theorem A.5
Consider μ a word ensemble, f:suppμ→[0,1] and D⊆{0,1}∗. Assume ^PD is an E∗(poly,rlog)-optimal predictor for (D,μ) and ^Pf∣D is an E∗(poly,rlog)-optimal predictor for (f,μ∣D). Then ^PD^Pf∣D is an E∗(poly,rlog)-optimal predictor for (χDf,μ).
Definition A.3
We define the stabilizer of E, denoted stabE, to be the set of functions γ:N2→R>0 s.t. for any δ∈E we have γδ∈E.
Theorem A.6
Fix h a polynomial s.t. 2−h∈Δ. Consider μ a word ensemble, f:suppμ→[0,1] and D⊆{0,1}∗. Assume μk(D)−1∈stabE. Assume ^P is an E∗(poly,rlog)-optimal predictor for (D,μ) and ^PχDf is an E∗(poly,rlog)-optimal predictor for (χDf,μ). Define ^Pf∣D by
^Pkjf∣D(x):=⎧⎪
⎪⎨⎪
⎪⎩1if ^PkjD(x)=0η(^PkjχDf(x)^PkjD(x))rounded to h(k,j) binary places if ^PkjD(x)>0
Then, ^Pf∣D is an E∗(poly,rlog)-optimal predictor for (f,μ∣D).
Definition A.4
Consider μ a word ensemble, ^Q1, ^Q2(poly,rlog)-predictors.
We say ^Q1 is E-similar to ^Q2 relative to μ (denoted ^Q1μ≃E^Q2) when Eμk×Ur1(k,j)×Ur2(k,j)×σkj1×σkj2[(^Qkj1(x)−^Qkj2(x))2]∈E.
Let Δ be an error space.
We say ^Q1 is Δ-similar to ^Q2 relative to μ (denoted ^Q1μ≃Δ^Q2) when Eμk×Ur1(k,j)×Ur2(k,j)×σkj1×σkj2[(^Qkj1(x)−^Qkj2(x))2]∈Δ.
Theorem A.7 (uniqueness theorem)
Consider (f,μ) a distributional estimation problem, ^P an E(poly,rlog)-optimal predictor for (f,μ) and ^Q an (poly,rlog)-predictor.
If ^Q is a E(poly,rlog)-optimal predictor for (f,μ) then ^Pμ≃E12^Q.
Conversely, if ^Pμ≃E1∞^Q then ^Q is a E∗(poly,rlog)-optimal predictor for (f,μ).
Definition A.5
E is called stable when for any non-constant polynomial p:N→N there is αp>0 s.t. for any δ∈E, the function δ′(k,j):=δ(p(k),j) is in Eαp.
Proposition A.1
E2(ll) is stable.
Theorem A.8
Assume E is ample and stable. Consider (f,μ), (g,ν) distributional estimation problems, ^ζ a E1∞-pseudo-invertible reduction of (f,μ) to (g,ν) and ^Pg an E∗(poly,rlog)-optimal predictor for (g,ν). Define ^Pf by ^Pkjf(x):=^Pkjg(^ζkj(x)). Then, ^Pf is an E∗(poly,rlog)-optimal predictor for (f,μ).
Theorem A.9
Assume E is ample and stable. Consider (f,μ), (g,ν) distributional estimation problems, ^ζ a E1∞-pseudo-invertible weak reduction of (f,μ) to (g,ν) and ^Pg an E∗(poly,rlog)-optimal predictor for (g,ν). Choose h:N2→N a polynomial with 1h∈E and define ^Pf by
^Pkjf(x):=1h(k,j)h(k,j)∑i=1^Pkjg(^ζkj(x…)…)
Here, the ellipses signify that each term corresponds to an independent sampling of Urζ(k,j)×Urg(k,j)×σkjg. Then, ^Pf is an E∗(poly,rlog)-optimal predictor for (f,μ).
Theorem A.10
Consider (f,μ) a distributional estimation problem, ϕ∈Φ and ^G a weak Δ2sqp,ϕ(log)-generator for (f,μ). Then, ^Λ[^G] is an E2(ll,ϕ)(poly,log)-optimal predictor for (f,μ).
Appendix B
Definition B.1
Given n∈N, a function δ:N2+n→R≥0 is called E-moderate when
(i) δ is non-decreasing in arguments 3 to 2+n.
(ii) For any collection of polynomials {pi:N2→N}i<n, δ(k,j,p0(k,j)…pn−1(k,j))∈E
Lemmas B.1 and B.2 below are given only for future reference (and as an aid in spelling out the proofs of other Theorems in Appendix A).
Lemma B.1
Fix (f,μ) a distributional estimation problem and ^P a (poly,rlog)-predictor. Then, ^P is E(poly,rlog)-optimal iff there is a E-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→[0,1]
Assume E is ample. Fix (f,μ) a distributional estimation problem and ^P a corresponding E(poly,rlog)-optimal predictor. Consider ^Q a (poly,rlog)-predictor, M>0, ^w a Q∩[0,M]-valued (poly,rlog)-bischeme. Assume rw≥max(rP,rQ) and uP,uQ:{0,1}∗→{0,1}∗ are s.t. uP∗(σw)=σP and uQ∗(σw)=σQ. Then there is δ∈E s.t.
Suppose h:N2→N is a polynomial s.t. 1h∈E. Given t∈[0,M], define αkj(t) to be t rounded within error h(k,j)−1. Thus, the number of digits in αkj(t) is logarithmic in k and j. Consider ^Qt:=(Qt,rw,σu) the (poly,rlog)-predictor defined by
Consider (f,μ) a distributional estimation problem and ^P an E(poly,rlog)-optimal predictor for (f,μ). Then there are c1,c2∈R and an E12-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→Q
Conversely, consider M∈Q and ^P a Q∩[−M,+M]-valued (poly,rlog)-bischeme. Suppose that for any Q∩[−M−1,+M]-valued (poly,log)-bischeme (Q,s,b) we have |E[Q(P−f)]|∈E.
Define ~P to be s.t. computing ~Pkj is equivalent to computing η(^Pkj) rounded to h(k,j) digits after the binary point, where 2−h∈E. Then, ~P is an E(poly,rlog)-optimal predictor for (f,μ).
We ommit the proofs of Lemma B.3 and Lemma B.4 below since they are closely analogous to before.
Lemma B.4
Consider (f,μ) a distributional estimation problem, ^P, ^Q(poly,rlog)-predictors. Suppose p:N2→N a polynomial, ϕ∈Φ and δ∈E2(ll,ϕ) are s.t.
By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose ^G is a E1∞(log)-generator for (f2,μ2). For the first term, we have
where δQ,2∈Eα2 for some α2∈R>0 that doesn't depend on Q. Again, we got the required bound.
Proof of Proposition A.1
Consider a non-constant polynomial p:N→N and δ∈E2(ll). Define δ′(k,j):=δ(p(k),j). To get the desired condition for δ′ and ϕ∈Φ, consider any ϕ′∈Φ s.t. for sufficiently large k we have ϕ′(p(k))=ϕ(k). For any ϵ∈(0,1) we have
limk→∞ϕ′(k)ϵEλkϕ′[δ(k,j)]=0
In particular
limk→∞ϕ′(p(k))ϵEλp(k)ϕ′[δ(p(k),j)]=0
limk→∞ϕ(k)ϵEλkϕ[δ′(k,j)]=0
Appendix C
Proof of Proposition 1
To check condition (i), consider δ1,δ2∈E.
If α>1, (δα1+δα2)1α≤δ1+δ2∈E hence (δα1+δα2)1α∈E and δα1+δα2∈Eα.
If α≤1, (δα1+δα2)1α=21α(δα1+δα22)1α≤21αδ1+δ22∈E hence (δα1+δα2)1α∈E and δα1+δα2∈Eα.
Conditions (ii) and (iii) are obvious.
Proof of Proposition 2
Consider δ∈E. We need to show that δγ∈Eα i.e. that δγα∈E. But γα>1 hence δγα=(supδ)γα(δsupδ)γα≤(supδ)γαδsupδ∈E.
Proof of Proposition 3
Follows immediately from Lemma B.1.
Proof of Proposition 4
Suppose ^P is a E(poly,rlog)-optimal predictor. Set akj:=argminz∈suppσkjPEμk×UrP(k,j)[(Pkj(x,y,z)−f(x))2]. Replacing σ by a we get the desired E(poly,log)-optimal predictor.
Proposition C.1
For any ψ∈Φ, min(loglog(k+3)loglog(j+3),1)ψ(k)∈E2(ll)
In particular, this implies 1j+1∈E2(ll) so E2(ll) is ample.
Proof of Proposition C.1
Denote δψ(k,j):=min(loglog(k+3)loglog(j+3),1)ψ(k). Consider ϕ∈Φ, ϵ∈(0,1). We have
The only not entirely obvious part is condition (iii) for E2(ll) which follows from Proposition C.1 (since 2−j∈E2(ll)).
Construction C.1
Fix R=(Σ,f,μ) a reflective system.
For any j∈N, denote Wj⊆{0,1}∗ the set of the first j words in lexicographic order. Denote Δj the space of probability distributions on Wj. Denote ΔΣ:=∏k,j∈Nn∈ΣΔj. Denote WΣ:={(n∈Σ,k∈N,j∈N,x∈Wj)}. Let VΣ be the locally convex topological vector space WΣ→R, where the topology is the product topology.
ΔΣ is a compact (by Tychonoff's theorem) convex subset of VΣ. Each ϑ∈ΔΣ can be regarded as a probability measure on AΣ.
Given a∈AΣ, we define ¯a∈ΠΣ by ¯akjn:=(akjn,j).
Given k,j∈N and n∈Σ, define ϵkjR,n:ΔΣ×Δj→[0,1] as follows
ϵkjR,n is continuous in the 2nd argument and Δj is compact for any j∈N. Therefore, for any ϑ∈ΔΣ, κR(ϑ)⊆ΔΣ is non-empty by the extreme value theorem. It is compact by Tychonoff's theorem and it is obviously convex.
Given Y⊆suppμkn finite, define ϵkjR,Y,n:ΔΣ×Δj→[0,1] by
ϵkjR,Y,n is continuous since f is continuous. Regarding the Ys as a net by set inclusion, ϵkjR,Y,n uniformly converges to ϵkjR,n, therefore ϵkjR,n is continuous.
We need to show that for every n∈Σ, x∈suppμn, q∈Ωμ and ϵ>0, there is a finite set X⊆Dμ and δ>0 s.t. for every ~q∈Ωμ with ∀(n,k,j,y)∈X:|qkjn(y)−~qkjn(y)|<δ we have |f(x,q)−f(x,~q)|<ϵ.
Choose k∈N s.t. x∈suppμkn, Z⊆Σ finite and Y⊆{0,1}∗ finite. Define X:={(k,j∈N,m∈Z,y∈Y)∣j<tψnm(k)}. We get
By choosing Z with ρn(Σ∖Z) sufficiently small and δ sufficiently small we get the desired condition.
Proposition C.3
Fix a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Consider R=(f,μ) a ϕ-reflective system and two collections of (poly,rlog)-predictors {^Q1n}n∈Σ and {^Q2n}n∈Σ. Assume that for some γ∈(0,1], ∀n∈Σ:^Q1nμn≃Eγ2(ll)^Q2n. Then
δm∈E2(ll) hence Eλkψnm[δm(k,j)]∈E1(ψnm)⊆E1(ϕn). This implies Eρn[Eλkψnm[δm(k,j)]]∈E1(ϕn) and Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]∈Eαnγ1(ϕn)
Definition C.1
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. Given R=(f,μ) a ϕ-reflective system, the associated reflective systemex−1R=(Σ,ex−1f,μ) is defined by
(ex−1fn)(x,π):=fn(x,exμ(π))
fn is continuous thanks to Proposition 6 since exμ is continuous.
Proof of Theorem 2
Fix a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Consider R a ϕ-reflective system. By Theorem 1, there is ^R an E2(ll)(poly,rlog)-optimal predictor system for ex−1R. For each n∈Σ we can use Proposition 4 to choose ^Pn, an E2(ll)(poly,log)-optimal predictor for (ex−1R)[^R]n=R[^R]n. By Theorem A.7, we have ^Pnμn≃E122(ll)^Rn. By Proposition C.3 this implies Eμkn[(R[^P]n(x)−R[^R]n(x))2]∈E1∞1(ϕn). This means ^Pn is an E∗2(ll,ϕn)(poly,log)-optimal predictor for R[^P]n and ^P is an E∗2(ll,ϕ)(poly,log)-optimal predictor system for R.
A change in terminology: It is convenient when important concepts have short names. The concept of an "optimal predictor scheme" seems much more important than its historical predecessor, the "optimal predictor". Therefore "optimal predictor schemes" will be henceforth called just "optimal predictors" while the previous concept of "optimal predictor" might be called "flat optimal predictor".
We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where self-prediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.
The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.
Results
Notation
Given sets X and Y, X→Y will denote the set of mappings from X to Y.
Before taking on reflection, we introduce a stronger concept of optimal predictor, to which the previous existence theorems still apply.
Definition 1
Let r be a positive integer. A proto-error space of rank r is a set E of bounded functions from Nr to R≥0 s.t.
(i) If δ1,δ2∈Δ then δ1+δ2∈E.
(ii) If δ1∈Δ and δ2≤δ1 then δ2∈E.
(iii) There is a polynomial h:Nr→R s.t. 2−h∈E.
Proposition 1
If E is a proto-error space of rank r and α∈R>0, then Eα:={δα∣δ∈E} is also a proto-error space of rank r.
Proposition 2
If E is a proto-error space of rank r, α,γ∈R>0 and α<γ then Eγ⊆Eα.
Definition 3
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P a (poly,log)-predictor.
^P is called an E(poly,log)-optimal predictor for (f,μ) when for any (poly,log)-predictor ^Q, there is δ∈E s.t.
Eμk×UrP(k,j)[(^Pkj(x)−f(x))2]≤Eμk×UrQ(k,j)[(^Qkj(x)−f(x))2]+δ(k,j)
^P is called an E∗(poly,log)-optimal predictor for (f,μ) when there is α>0 s.t. ^P is an Eα(poly,log)-optimal predictor for (f,μ).
E∗(poly,log)-optimal predictors have properties closely parallel to Δ(poly,log)-optimal predictors. The corresponding theorems are listed in Appendix A. Most theorems are given without proof, as the proofs are closely analogous to before, with the exception of Theorem A.4 which is proven in Appendix B.
We now consider a generalization in which the advice is allowed to be random.
Definition 4
Given appropriate sets X and Y, consider Q:N2×X×{0,1}∗2alg−→Y, rQ:N2→N polynomial and {σkjQ:{0,1}∗→[0,1]}k,j∈N a family of probability measures. The triple ^Q=(Q,rQ,σQ) is called (poly,rlog)-bischeme of signature X→Y when
(i) The runtime of Qkj(x,y,z) is bounded by p(k,j) with p polynomial.
(ii) suppσkjQ⊆{0,1}≤c⌊log(k+2)+log(j+2)⌋ for some c∈N.
We will use the notation ^Qkj(x,y) to stand for the obvious Y-valued σkjQ-random variable and the notation ^Qkj(x) to stand for the obvious Y-valued UrQ(k,j)×σkjQ-random variable.
A (poly,rlog)-bischeme of signature {0,1}∗→Y will also be called a Y-valued (poly,rlog)-bischeme.
A [0,1]-valued (poly,rlog)-bischeme will also be called a (poly,rlog)-predictor.
Note 1
Conceptually advice corresponds to precomputation which might be expensive or even "hyperprecomputation". Random advice corresponds to random precomputation. Random logarithmic advice can be always replaced by deterministic polynomial advice at the cost of a small rounding error.
Definition 5
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P a (poly,rlog)-predictor. ^P is called an E(poly,rlog)-optimal predictor for (f,μ) when for any (poly,rlog)-predictor ^Q, there is δ∈E s.t.
Eμk×UrP(k,j)×σkjP[(^Pkj(x)−f(x))2]≤Eμk×UrQ(k,j)×σkjQ[(^Qkj(x)−f(x))2]+δ(k,j)
The concept of an E(poly,rlog)-optimal predictor is essentially of the same strength as an E(poly,rlog)-optimal predictors, as seen in the following two Propositions.
Proposition 3
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. Consider ^P an E(poly,log)-optimal predictor. Then it defines a E(poly,rlog)-optimal predictor by setting σkjP(akjP)=1.
Proposition 4
Fix E a proto-error space of rank 2 and (f,μ) a distributional estimation problem. If (f,μ) admits a E(poly,rlog)-optimal predictor then it admits a E(poly,log)-optimal predictor.
We are now ready to introduce the key abstraction.
Definition 6
Given a set Σ, denote ΠΣ:=Σ×N2→{0,1}∗×N equipped with the product topology.
A reflective system is a triple (Σ,f,μ) where Σ is a set, {μkn:{0,1}∗→[0,1]}n∈Σ,k∈N is a collection of probability measures where we regard each μn as a word ensemble and {fn:suppμn×ΠΣ→[0,1]}n∈Σ is a collection of continuous functions (here suppμn has the discrete topology or, equivalently, we only require continuity in the second variable).
The motivation behind this definition is regarding ΠΣ as the space of possible predictor programs, where the first factor of {0,1}∗×N is the program itself (including advice) and the second factor is the number of intrinsic coin flips. Thus the fn represent a system of computations in which each has access to the source code of predictors for the entire system.
Definition 7
Consider a reflective system R=(Σ,f,μ) and a collection of (poly,rlog)-predictors {^Qn=(Qn,rn,σn)}n∈Σ.
Denote AΣ:=Σ×N2→{0,1}∗ equipped with the product σ-algebra. Denote σ:=∏n,k,jσkjn, a probability measure on AΣ.
Given n∈Σ, k,j∈N and a∈{0,1}∗ denote Qkjn[a]∈{0,1}∗ to be the program that computes Qkjn(x,y,a) given input x,y∈{0,1}∗. Given a∈AΣ, define ^Q[a]∈ΠΣ by ^Q[a]kjn:=(Qkjn[akjn],rn(k,j)).
Given n∈Σ, R[^Q]n:suppμn→[0,1] is defined as follows
R[^Q]n(x):=Eσ[fn(x,^Q[a])]
The expectation value is well-defined thanks to the continuity of fn.
Definition 8
Fix E a proto-error space of rank 2 and R=(Σ,f,μ) a reflective system. A collection {Pn}n∈Σ of (poly,rlog)-predictors is called an E(poly,rlog)-optimal predictor system for R when for any n∈Σ, Pn is a E(poly,rlog)-optimal predictor for (R[P]n,μn).
Construction 1
Given ϕ∈Φ, denote E1(ϕ) the set of bounded functions δ:N→R≥0 s.t.
∀ϵ∈(0,1):limk→∞ϕ(k)1−ϵδ(k)=0
Given ϕ∈Φ, denote E2(ll,ϕ) the set of bounded functions δ:N2→R≥0 s.t.
∀ψ∈Φ:ψ≤ϕ⟹Eλkψ[δ(k,j)]∈E1(ψ)
Denote
E2(ll):=⋂ϕ∈ΦE2(ll,ϕ)
Proposition 5
If ϕ∈Φ is s.t. ∃n:limk→∞2−knϕ(k)=0, E1(ϕ) is a proto-error space.
For any ϕ∈Φ, E2(ll,ϕ) is a proto-error space.
E2(ll) is a proto-error space.
Note 2
E∗2(ll)-optimality is strictly stronger than Δ2ll-optimality. We exploit this strength in Theorem 2 below which is the main motivation for introducing E-optimality at this point.
Theorem 1 (general existence theorem)
Any reflective system has a E2(ll)(poly,rlog)-optimal predictor system.
We also prove a more restricted existence theorem which allows deterministic advice.
Definition 9
Consider a set Σ and {μkn:{0,1}∗→[0,1]}n∈Σ,k∈N a collection of probability measures. Denote Dμ:={(n∈Σ,k∈N,j∈N,x∈{0,1}∗)∣μkn(x)>0}. Denote Ωμ:=Dμ→[0,1].
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. A ϕ-reflective system is a pair (f,μ) where μ is as above and {fn:suppμn×Ωμ→[0,1]}n∈Σ is a collection of functions s.t. there are collections {ψnm∈Φ}n,m∈Σ, {αn∈(0,1]}n∈Σ, {cn∈R>0}n∈Σ and probability measures {ρn:Σ→[0,1]}n∈Σ satisfying
∀n,m∈Σ:ψnm≥ϕn
∀n∈Σ,k∈N,q,~q∈Ωμ:Eμkn[(fn(x,q)−fn(x,~q))2]≤cn(Eρn[Eλkψnm×μkm[(qkjm(x)−~qkjm(x))2]])αn
Note 3
The condition on f is a kind of Hoelder condition, uniform over k.
Proposition 6
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. For any ϕ-reflective system (f,μ) and n∈Σ, fn is continuous with respect to the product topology on Ωμ.
Definition 10
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ.
Given m∈N, x,y1,y2…ym∈{0,1}∗, we let ev(x,y1,y2…ym)∈{0,1}≤ω stand for the evaluation of program x on inputs y1,y2…ym without a time limit (we assume that on the output tape the machine head moves right iff it produces a symbol in {0,1} and cannot be moved left). We also extend β to be defined on {0,1}≤ω in the obvious way.
We define exμ:ΠΣ→Ωμ by exμ(π)kjn(x):=EU(πkjn)2[β(ev((πkjn)1,x,y))].
Consider a ϕ-reflective system R=(f,μ) and a collection of (poly,rlog)-predictors {^Qn=(Qn,rn,σn)}n∈Σ. Given n∈Σ, R[^Q]n:suppμn→[0,1] is defined as follows
R[^Q]n(x):=Eσ[fn(x,exμ(^Q[a]))]
The expectation value is well-defined due to Proposition 5.
Definition 11
Fix a set Σ, {En}n∈Σ a collection of proto-error spaces of rank 2, a collection {ϕn∈Φ}n∈Σ and R=(f,μ) a ϕ-reflective system. A collection {Pn}n∈Σ of (poly,log)-predictors is called an E∗(poly,log)-optimal predictor system for R when for any n∈Σ there is γ∈R>0 s.t. Pn is an Eγn(poly,log)-optimal predictor for (R[P]n,μn).
Theorem 2
Consider a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Any ϕ-reflective system has an E∗2(ll,ϕ)(poly,log)-optimal predictor system.
Appendix A
Definition A.1
E, a proto-error space of rank r, is called ample when there is a polynomial h:Nr→R>0 s.t. 1h∈E.
Fix E, a proto-error space of rank 2.
Theorem A.1
Assume E is ample. Consider (f,μ) a distributional estimation problem, ^P an E(poly,rlog)-optimal predictor for (f,μ) and {pkj∈[0,1]}k,j∈N, {qkj∈[0,1]}k,j∈N. Define γ:N2→R>0 by
γ(k,j):=Prμk×UrP(k,j)×σkjP[pkj≤^Pkj≤qkj]−1
Define
ϕkj:=Eμk×UrP(k,j)×σkjP[f−^Pkj∣pkj≤^Pkj≤qkj]
Assume that either pkj,qkj have a number of digits logarithmically bounded in k,j or ^Pkj produces outputs with a number of digits logarithmically bounded in k,j. Then, |ϕ|∈(γE)12.
Theorem A.2
Consider μ a word ensemble and f1,f2:suppμ→[0,1] s.t. f1+f2≤1. Suppose ^P1 is an E∗(poly,rlog)-optimal predictor for (f1,μ) and ^P2 is an E∗(poly,rlog)-optimal predictor for (f2,μ). Then, ^P1+^P2 is an E∗(poly,rlog)-optimal predictor for (f1+f2,μ).
Theorem A.3
Consider μ a word ensemble and f1,f2:suppμ→[0,1] s.t. f1+f2≤1. Suppose ^P1 is an E∗(poly,rlog)-optimal predictor for (f1,μ) and ^P2 is an E∗(poly,rlog)-optimal predictor for (f1+f2,μ). Then, ^P2−^P1 is an E∗(poly,rlog)-optimal predictor for (f2,μ).
Definition A.2
Given E a proto-error space of rank r, the associated error space is E1∞:=⋃ϵ>0Eϵ.
Theorem A.4
Consider (f1,μ1), (f2,μ2) distributional estimation problems with respective E∗(poly,rlog)-optimal predictors ^P1 and ^P2. Assume μ1 is E1∞(log)-sampleable and (f2,μ2) is E1∞(log)-generatable. Then, ^P1×^P2 is an E∗(poly,rlog)-optimal predictor for (f1×f2,μ1×μ2).
Theorem A.5
Consider μ a word ensemble, f:suppμ→[0,1] and D⊆{0,1}∗. Assume ^PD is an E∗(poly,rlog)-optimal predictor for (D,μ) and ^Pf∣D is an E∗(poly,rlog)-optimal predictor for (f,μ∣D). Then ^PD^Pf∣D is an E∗(poly,rlog)-optimal predictor for (χDf,μ).
Definition A.3
We define the stabilizer of E, denoted stabE, to be the set of functions γ:N2→R>0 s.t. for any δ∈E we have γδ∈E.
Theorem A.6
Fix h a polynomial s.t. 2−h∈Δ. Consider μ a word ensemble, f:suppμ→[0,1] and D⊆{0,1}∗. Assume μk(D)−1∈stabE. Assume ^P is an E∗(poly,rlog)-optimal predictor for (D,μ) and ^PχDf is an E∗(poly,rlog)-optimal predictor for (χDf,μ). Define ^Pf∣D by
^Pkjf∣D(x):=⎧⎪ ⎪⎨⎪ ⎪⎩1if ^PkjD(x)=0η(^PkjχDf(x)^PkjD(x))rounded to h(k,j) binary places if ^PkjD(x)>0
Then, ^Pf∣D is an E∗(poly,rlog)-optimal predictor for (f,μ∣D).
Definition A.4
Consider μ a word ensemble, ^Q1, ^Q2 (poly,rlog)-predictors.
We say ^Q1 is E-similar to ^Q2 relative to μ (denoted ^Q1μ≃E^Q2) when Eμk×Ur1(k,j)×Ur2(k,j)×σkj1×σkj2[(^Qkj1(x)−^Qkj2(x))2]∈E.
Let Δ be an error space.
We say ^Q1 is Δ-similar to ^Q2 relative to μ (denoted ^Q1μ≃Δ^Q2) when Eμk×Ur1(k,j)×Ur2(k,j)×σkj1×σkj2[(^Qkj1(x)−^Qkj2(x))2]∈Δ.
Theorem A.7 (uniqueness theorem)
Consider (f,μ) a distributional estimation problem, ^P an E(poly,rlog)-optimal predictor for (f,μ) and ^Q an (poly,rlog)-predictor.
If ^Q is a E(poly,rlog)-optimal predictor for (f,μ) then ^Pμ≃E12^Q.
Conversely, if ^Pμ≃E1∞^Q then ^Q is a E∗(poly,rlog)-optimal predictor for (f,μ).
Definition A.5
E is called stable when for any non-constant polynomial p:N→N there is αp>0 s.t. for any δ∈E, the function δ′(k,j):=δ(p(k),j) is in Eαp.
Proposition A.1
E2(ll) is stable.
Theorem A.8
Assume E is ample and stable. Consider (f,μ), (g,ν) distributional estimation problems, ^ζ a E1∞-pseudo-invertible reduction of (f,μ) to (g,ν) and ^Pg an E∗(poly,rlog)-optimal predictor for (g,ν). Define ^Pf by ^Pkjf(x):=^Pkjg(^ζkj(x)). Then, ^Pf is an E∗(poly,rlog)-optimal predictor for (f,μ).
Theorem A.9
Assume E is ample and stable. Consider (f,μ), (g,ν) distributional estimation problems, ^ζ a E1∞-pseudo-invertible weak reduction of (f,μ) to (g,ν) and ^Pg an E∗(poly,rlog)-optimal predictor for (g,ν). Choose h:N2→N a polynomial with 1h∈E and define ^Pf by
^Pkjf(x):=1h(k,j)h(k,j)∑i=1^Pkjg(^ζkj(x…)…)
Here, the ellipses signify that each term corresponds to an independent sampling of Urζ(k,j)×Urg(k,j)×σkjg. Then, ^Pf is an E∗(poly,rlog)-optimal predictor for (f,μ).
Theorem A.10
Consider (f,μ) a distributional estimation problem, ϕ∈Φ and ^G a weak Δ2sqp,ϕ(log)-generator for (f,μ). Then, ^Λ[^G] is an E2(ll,ϕ)(poly,log)-optimal predictor for (f,μ).
Appendix B
Definition B.1
Given n∈N, a function δ:N2+n→R≥0 is called E-moderate when
(i) δ is non-decreasing in arguments 3 to 2+n.
(ii) For any collection of polynomials {pi:N2→N}i<n, δ(k,j,p0(k,j)…pn−1(k,j))∈E
Lemmas B.1 and B.2 below are given only for future reference (and as an aid in spelling out the proofs of other Theorems in Appendix A).
Lemma B.1
Fix (f,μ) a distributional estimation problem and ^P a (poly,rlog)-predictor. Then, ^P is E(poly,rlog)-optimal iff there is a E-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→[0,1]
Eμk×UrP(k,j)×σkjP[(Pkj(x,y,z)−f(x))2]≤Eμk×Us[(Q(x,y)−f(x))2]+δ(k,j,TμQ(k,s),2|Q|)
Proof of Lemma B.1
Define
δ(k,j,t,u):=maxTμQ(k,s)≤t|Q|≤logumax(Eμk×UrP(k,j)×σkjP[(Pkj(x,y,z)−f(x))2]−Eμk×Us[(Q(x,y)−f(x))2],0)
Lemma B.2
Assume E is ample. Fix (f,μ) a distributional estimation problem and ^P a corresponding E(poly,rlog)-optimal predictor. Consider ^Q a (poly,rlog)-predictor, M>0, ^w a Q∩[0,M]-valued (poly,rlog)-bischeme. Assume rw≥max(rP,rQ) and uP,uQ:{0,1}∗→{0,1}∗ are s.t. uP∗(σw)=σP and uQ∗(σw)=σQ. Then there is δ∈E s.t.
Eμk×Urw(k,j)×σkjw[wkj(x,y,z)(Pkj(x,y≤rP(k,j),uP(z))−f(x))2]≤Eμk×Urw(k,j)×σkjw[wkj(x,y,z)(Qkj(x,y≤rQ(k,j),uQ(z))−f(x))2]+δ(k,j)
Proof of Lemma B.2
Suppose h:N2→N is a polynomial s.t. 1h∈E. Given t∈[0,M], define αkj(t) to be t rounded within error h(k,j)−1. Thus, the number of digits in αkj(t) is logarithmic in k and j. Consider ^Qt:=(Qt,rw,σu) the (poly,rlog)-predictor defined by
σu:=(1×uP×uQ)∗σw
Qkjt(x,y,(a,b,c)):={Qkj(x,y≤rQ(k,j),c)if wkj(x,y,a)≥αkj(t)Pkj(x,y≤rP(k,j),b)if wkj(x,y,a)<αkj(t)
^Qt satisfies bounds on runtime and advice size uniform in t. Therefore, Lemma B.1 implies that there is δ∈E s.t.
E[(^Pkj(x)−f(x))2]≤E[(^Qkjt(x)−f(x))2]+δ(k,j)
E[(^Pkj(x)−f(x))2−(^Qkjt(x)−f(x))2]≤δ(k,j)
E[θ(^wkj(x)−αkj(t))((^Pkj(x)−f(x))2−(^Qkj(x)−f(x))2)]≤δ(k,j)
E[∫M0θ(^wkj(x)−αkj(t))dt((^Pkj(x)−f(x))2−(^Qkj(x)−f(x))2)]≤Mδ(k,j)
E[^wkj(x)((^Pkj(x)−f(x))2−(^Qkj(x)−f(x))2)]≤Mδ(k,j)+h(k,j)−1
Lemma B.3 (orthogonality lemma)
Consider (f,μ) a distributional estimation problem and ^P an E(poly,rlog)-optimal predictor for (f,μ). Then there are c1,c2∈R and an E12-moderate function δ:N4→[0,1] s.t. for any k,j,s∈N, Q:{0,1}∗2alg−→Q
|Eμk×Us×UrP(k,j)×σkjP[Q(^Pkj−f)]|≤(c1+c2Eμk×Us[Q2])δ(k,j,TμQ(k,s),2|Q|)
Conversely, consider M∈Q and ^P a Q∩[−M,+M]-valued (poly,rlog)-bischeme. Suppose that for any Q∩[−M−1,+M]-valued (poly,log)-bischeme (Q,s,b) we have |E[Q(P−f)]|∈E.
Define ~P to be s.t. computing ~Pkj is equivalent to computing η(^Pkj) rounded to h(k,j) digits after the binary point, where 2−h∈E. Then, ~P is an E(poly,rlog)-optimal predictor for (f,μ).
We ommit the proofs of Lemma B.3 and Lemma B.4 below since they are closely analogous to before.
Lemma B.4
Consider (f,μ) a distributional estimation problem, ^P, ^Q (poly,rlog)-predictors. Suppose p:N2→N a polynomial, ϕ∈Φ and δ∈E2(ll,ϕ) are s.t.
∀i,k,j∈N:E[(^Pk,p(k,j)+i−f)2]≤E[(^Qkj−f)2]+δ(k,j)
Then ∃δ′∈E2(ll,ϕ) s.t.
E[(^Pkj−f)2]≤E[(^Qkj−f)2]+δ′(k,j)
Proof of Theorem A.4
Denote ^P:=^P1×^P2. We have
^P(x1,x2)−(f1×f2)(x1,x2)=(^P1(x1)−f1(x1))f2(x2)+^P1(x1)(^P2(x2)−f2(x2))
Therefore, for any Q∩[−1,+1]-valued (poly,log)-bischeme ^Q
|E[^Q(^P−f1×f2)]|≤|E[^Q(x1,x2)(^P1(x1)−f1(x1))f2(x2)]|+|E[^Q(x1,x2)^P1(x1)(^P2(x2)−f2(x2))]|
By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose ^G is a E1∞(log)-generator for (f2,μ2). For the first term, we have
|E[^Qkj(x1,x2)(^Pkj1(x1)−f1(x1))f2(x2)]|≤|E[^Qkj(x1,^Gkj1)(^Pkj1(x1)−f1(x1))^Gkj2]|+δ2(k,j)
where δ2∈E1∞ doesn't depend on Q. Applying Lemma B.3 for ^P1, we get
|E[^Qkj(x1,x2)(^Pkj1(x1)−f1(x1))f2(x2)]|≤δQ,1(k,j)+δ2(k,j)
where δQ,1∈Eα1 for some α1∈R>0 that doesn't depend on Q.
Suppose ^S is a E1∞(log)-sampler for μ1. For the second term, we have
|E[^Qkj(x1,x2)^P1(x1)(^Pkj2(x2)−f2(x2))]|≤|E[^Qkj(^Skj,x2)^P1(^Skj)(^Pkj2(x2)−f2(x2))]|+δ1(k,j)
where δ1∈E1∞ doesn't depend on Q. Applying Lemma B.3 for ^P2, we get
|E[^Qkj(x1,x2)^P1(x1)(^Pkj2(x2)−f2(x2))]|≤δQ,2(k,j)+δ1(k,j)
where δQ,2∈Eα2 for some α2∈R>0 that doesn't depend on Q. Again, we got the required bound.
Proof of Proposition A.1
Consider a non-constant polynomial p:N→N and δ∈E2(ll). Define δ′(k,j):=δ(p(k),j). To get the desired condition for δ′ and ϕ∈Φ, consider any ϕ′∈Φ s.t. for sufficiently large k we have ϕ′(p(k))=ϕ(k). For any ϵ∈(0,1) we have
limk→∞ϕ′(k)ϵEλkϕ′[δ(k,j)]=0
In particular
limk→∞ϕ′(p(k))ϵEλp(k)ϕ′[δ(p(k),j)]=0
limk→∞ϕ(k)ϵEλkϕ[δ′(k,j)]=0
Appendix C
Proof of Proposition 1
To check condition (i), consider δ1,δ2∈E.
If α>1, (δα1+δα2)1α≤δ1+δ2∈E hence (δα1+δα2)1α∈E and δα1+δα2∈Eα.
If α≤1, (δα1+δα2)1α=21α(δα1+δα22)1α≤21αδ1+δ22∈E hence (δα1+δα2)1α∈E and δα1+δα2∈Eα.
Conditions (ii) and (iii) are obvious.
Proof of Proposition 2
Consider δ∈E. We need to show that δγ∈Eα i.e. that δγα∈E. But γα>1 hence δγα=(supδ)γα(δsupδ)γα≤(supδ)γαδsupδ∈E.
Proof of Proposition 3
Follows immediately from Lemma B.1.
Proof of Proposition 4
Suppose ^P is a E(poly,rlog)-optimal predictor. Set akj:=argminz∈suppσkjPEμk×UrP(k,j)[(Pkj(x,y,z)−f(x))2]. Replacing σ by a we get the desired E(poly,log)-optimal predictor.
Proposition C.1
For any ψ∈Φ, min(loglog(k+3)loglog(j+3),1)ψ(k)∈E2(ll)
In particular, this implies 1j+1∈E2(ll) so E2(ll) is ample.
Proof of Proposition C.1
Denote δψ(k,j):=min(loglog(k+3)loglog(j+3),1)ψ(k). Consider ϕ∈Φ, ϵ∈(0,1). We have
Eλkϕ[δψ(k,j)]=Prλkϕ[j<tϕϵ2(k)]Eλkϕ[δψ(k,j)∣j<tϕϵ2(k)]+Prλkϕ[j≥tϕϵ2(k)]Eλkϕ[δψ(k,j)∣j≥tϕϵ2(k)]
limsupk→∞ϕ(k)1−ϵEλkϕ[δ(k,j)]≤limsupk→∞ϕ(k)1−ϵ(ϕ(k)ϵ2−1(supδψ)+supj≥tϕϵ2(k)δψ(k,j))
limsupk→∞ϕ(k)1−ϵEλkϕ[δ(k,j)]≤limsupk→∞ϕ(k)1−ϵ(ϕ(k)ϵ2−1+ϕ(k)−ϵ2ψ(k))
limsupk→∞ϕ(k)1−ϵEλkϕ[δ(k,j)]≤limsupk→∞(ϕ(k)−ϵ2+ϕ(k)1−ϵ−ϵ2ψ(k))
limk→∞ϕ(k)1−ϵEλkϕ[δ(k,j)]=0
Proof of Proposition 5
The only not entirely obvious part is condition (iii) for E2(ll) which follows from Proposition C.1 (since 2−j∈E2(ll)).
Construction C.1
Fix R=(Σ,f,μ) a reflective system.
For any j∈N, denote Wj⊆{0,1}∗ the set of the first j words in lexicographic order. Denote Δj the space of probability distributions on Wj. Denote ΔΣ:=∏k,j∈Nn∈ΣΔj. Denote WΣ:={(n∈Σ,k∈N,j∈N,x∈Wj)}. Let VΣ be the locally convex topological vector space WΣ→R, where the topology is the product topology.
ΔΣ is a compact (by Tychonoff's theorem) convex subset of VΣ. Each ϑ∈ΔΣ can be regarded as a probability measure on AΣ.
Given a∈AΣ, we define ¯a∈ΠΣ by ¯akjn:=(akjn,j).
Given k,j∈N and n∈Σ, define ϵkjR,n:ΔΣ×Δj→[0,1] as follows
ϵkjR,n(ϑ,ζ):=Eζ×μkn×Uj×ϑ[(evj(x,y,z)−fn(y,¯¯¯¯¯¯¯¯¯¯¯Υ[w]))2]
Define κR⊆ΔΣ×ΔΣ by
κR={(ϑ1,ϑ2)∣∀k,j∈N,n∈Σ,ζ∈Δj:ϵkjR,n(ϑ1,(ϑ2)kjn)≤ϵkjR,n(ϑ1,ζ)}
Proposition C.2
κR is a Kakutani map.
Proof of Proposition C.2
ϵkjR,n is continuous in the 2nd argument and Δj is compact for any j∈N. Therefore, for any ϑ∈ΔΣ, κR(ϑ)⊆ΔΣ is non-empty by the extreme value theorem. It is compact by Tychonoff's theorem and it is obviously convex.
Given Y⊆suppμkn finite, define ϵkjR,Y,n:ΔΣ×Δj→[0,1] by
ϵkjR,Y,n(ϑ,ζ):=∑y∈Yμkn(y)Eζ×Uj×ϑ[(evj(x,y,z)−fn(y,¯¯¯¯¯¯¯¯¯¯¯Υ[w]))2]
ϵkjR,Y,n is continuous since f is continuous. Regarding the Ys as a net by set inclusion, ϵkjR,Y,n uniformly converges to ϵkjR,n, therefore ϵkjR,n is continuous.
Given n∈Σ, k,j∈N, define κkjR,n⊆ΔΣ×ΔΣ by
κkjR,n={(ϑ1,ϑ2)∣∀ζ∈Δj:ϵkjR,n(ϑ1,(ϑ2)kjn)≤ϵkjR,n(ϑ1,ζ)}
κkjR,n is closed since ϵkjR,n is continuous. Therefore κR=⋂k,j∈Nn∈ΣκkjR,n is also closed.
Proof of Theorem 1
Using Proposition C.2, we apply the Kakutani-Glicksberg-Fan theorem to get (σ,σ)∈κR. Define ^Pn:=(Υ,j,σ).
Consider n∈Σ, ^Q a (poly,log)-predictor. Choose p:N2→N a polynomial and {qkj∈Wp(k,j)}k,j∈N s.t. p≥rQ and
∀i,k,j∈N,x∈suppμkn,y∈{0,1}p(k,j)+i:evp(k,j)+i(qkj,x,y)=Qkj(x,y<rQ(k,j),akjQ)
By definition of κR, we have
∀i,k,j∈N:ϵk,p(k,j)+iR,n(σ,σk,p(k,j)+in)≤ϵk,p(k,j)+iR,n(σ,qkj)
Here we implicitly used the natural injection Wm→Δm.
∀i,k,j∈N:Eσk,p(k,j)+in×μkn×Up(k,j)+i×σ[(evp(k,j)+i(x,y,z)−fn(y,¯¯¯¯¯¯¯¯¯¯¯Υ[w]))2]≤Eμkn×Up(k,j)+i×σ[(evp(k,j)+i(qkj,y,z)−fn(y,¯¯¯¯¯¯¯¯¯¯¯Υ[w]))2]
∀i,k,j∈N:Eσk,p(k,j)+in×μkn×Up(k,j)+i×σ[(Pk,p(k,j)+in(y,z,x)−fn(y,^Pn[w]))2]≤Eμkn×UrQ(k,j)×σ[(^Qkj(y,z)−fn(y,^Pn[w]))2]
Denoting ^σk,p(k,j)+in:=σk,p(k,j)+in×μkn×Up(k,j)+i, The left hand side satisfies
E^σk,p(k,j)+in×σ[(Pk,p(k,j)+in(y,z,x)−fn(y,^Pn[w]))2]=E^σk,p(k,j)+in[(Pk,p(k,j)+in(y,z,x)−Eσ[fn(y,^Pn[w])])2]+Varσ[fn(y,^Pn[w])]
Similarly, for the right hand side we have
Eμkn×UrQ(k,j)×σ[(^Qkj(y,z)−fn(y,^Pn[w]))2]=Eμkn×UrQ(k,j)[(^Qkj(y,z)−Eσ[fn(y,^Pn[w])])2]+Varσ[fn(y,^Pn[w])]
Combining the two together, we get
E^σk,p(k,j)+in[(Pk,p(k,j)+in(y,z,x)−Eσ[fn(y,^Pn[w])])2]≤Eμkn×UrQ(k,j)[(^Qkj(y,z)−Eσ[fn(y,^Pn[w])])2]
E^σk,p(k,j)+in[(^Pk,p(k,j)+in(y)−R[^P]n(y))2]≤Eμkn×UrQ(k,j)[(^Qkj(y)−R[^P]n(y))2]
Applying Lemma B.4 we conclude that there is δ∈E2(ll) s.t.
E^σkjn[(^Pkjn(y)−R[^P]n(y))2]≤Eμkn×UrQ(k,j)[(^Qkj(y)−R[^P]n(y))2]+δ(k,j)
Proof of Proposition 6
We need to show that for every n∈Σ, x∈suppμn, q∈Ωμ and ϵ>0, there is a finite set X⊆Dμ and δ>0 s.t. for every ~q∈Ωμ with ∀(n,k,j,y)∈X:|qkjn(y)−~qkjn(y)|<δ we have |f(x,q)−f(x,~q)|<ϵ.
Choose k∈N s.t. x∈suppμkn, Z⊆Σ finite and Y⊆{0,1}∗ finite. Define X:={(k,j∈N,m∈Z,y∈Y)∣j<tψnm(k)}. We get
Eμkn[(fn(y,q)−fn(y,~q))2]≤cn(Eρn[Eλkψnm×μkm[(qkjm(y)−~qkjm(y))2]])αn
μkn(x)(fn(x,q)−fn(x,~q))2≤cn(ρn(Σ∖Z)+δ2)αn
By choosing Z with ρn(Σ∖Z) sufficiently small and δ sufficiently small we get the desired condition.
Proposition C.3
Fix a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Consider R=(f,μ) a ϕ-reflective system and two collections of (poly,rlog)-predictors {^Q1n}n∈Σ and {^Q2n}n∈Σ. Assume that for some γ∈(0,1], ∀n∈Σ:^Q1nμn≃Eγ2(ll)^Q2n. Then
∀n∈Σ:Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]∈E1∞1(ϕn)
Proof of Proposition C.3
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]=Eμkn[(Eσ1[fn(x,exμ(^Q1[a]))]−Eσ2[fn(x,exμ(^Q2[a]))])2]
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤Eμkn×σ1×σ2[(fn(x,exμ(^Q1[a1]))−fn(x,exμ(^Q2[a2])))2]
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤Eσ1×σ2[cn(Eρn[Eλkψnm×μkm[(EUr1(k,j)[Qkj1m(x,y,a1)]−EUr2(k,j)[Qkj2m(x,y,a2)])2]])αn]
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤Eσ1×σ2[cn(Eρn[Eλkψnm[Eμkm×Ur1(k,j)×Ur2(k,j)[(Qkj1m(x,y,a1)−Qkj2m(x,y,a2))2]]])αn]
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤cn(Eρn[Eλkψnm[Eμkm×Ur1(k,j)×Ur2(k,j)×σ1×σ2[(Qkj1m(x,y,a1)−Qkj2m(x,y,a2))2]]])αn
Using the similarity of ^Q1 and ^Q2 there are {δn:N2→[0,1]∈E2(ll)}n∈Σ s.t.
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤cn(Eρn[Eλkψnm[δm(k,j)γ]])αn
Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]≤cn(Eρn[Eλkψnm[δm(k,j)]])αnγ
δm∈E2(ll) hence Eλkψnm[δm(k,j)]∈E1(ψnm)⊆E1(ϕn). This implies Eρn[Eλkψnm[δm(k,j)]]∈E1(ϕn) and Eμkn[(R[^Q1]n(x)−R[^Q1]n(x))2]∈Eαnγ1(ϕn)
Definition C.1
Fix a set Σ and a collection {ϕn∈Φ}n∈Σ. Given R=(f,μ) a ϕ-reflective system, the associated reflective system ex−1R=(Σ,ex−1f,μ) is defined by
(ex−1fn)(x,π):=fn(x,exμ(π))
fn is continuous thanks to Proposition 6 since exμ is continuous.
Proof of Theorem 2
Fix a finite set Σ and a collection {ϕn∈Φ}n∈Σ. Consider R a ϕ-reflective system. By Theorem 1, there is ^R an E2(ll)(poly,rlog)-optimal predictor system for ex−1R. For each n∈Σ we can use Proposition 4 to choose ^Pn, an E2(ll)(poly,log)-optimal predictor for (ex−1R)[^R]n=R[^R]n. By Theorem A.7, we have ^Pnμn≃E122(ll)^Rn. By Proposition C.3 this implies Eμkn[(R[^P]n(x)−R[^R]n(x))2]∈E1∞1(ϕn). This means ^Pn is an E∗2(ll,ϕn)(poly,log)-optimal predictor for R[^P]n and ^P is an E∗2(ll,ϕ)(poly,log)-optimal predictor system for R.