In Why quantitative finance is so hard I explained why the entropy of your dataset must exceed the entropy of your hypothesis space. I used a simple hypothesis space with n equally likely hypotheses each with m tunable parameters. Real life is not usually so homogeneous.
No Tunable Parameters
Consider an inhomogeneous hypothesis space with zero tunable parameters. Instead of H=logn which works for homogeneous hypothesis spaces, we must use more complicated entropy equation.
H=−n∑i=1ρilnρi
This equation makes intuitive sense. It vanishes when one ρi=j equals 1 and all other ρi≠j equal 0. Our equation is extremized when all ρi are equal at 1n. H=logn is the maximal case when ρi=1n∀i∈{1,…,n}[1].
With Tunable Parameters
Suppose each hypothesis i has mi tunable parameters. We can plug mi into our entropy equation.
H=n∑i=1ρi(mi−lnρi)
Our old equation H=m+logn is just the special case where all ρi are homogenous and mi are homogeneous too.
We have so far treated mi as representative of each hypothesis's tunable parameters. More generally, mi represents each hypothesis's internal entropy. If we think of hypotheses as a weighted tree, mi is what you get when you iterate one level down the tree. Our variable H identifies the root of the tree. Suppose ith branch of the next level down is called Hi.
H=n∑i=1ρi(Hi−lnρi)
We can define the entropy of the rest of the tree with a recursive equation.
Hμ=n∑i=1ρi(Hμ,i−lnρi)=n∑i=1(ρiHμ,i−ρilnρi)
There are two parts to this equation: the recursive componentρiHμ,i and the branching component−ρilnρi.
Branching component −ρilnρi
The −ρilnρi component is maximized when ρi=1e.
−ρilnρi=−1eln1e=1elne=1e
The branching component tops out at 1e. It can never contribute a massive quantity of entropy to your hypothesis space because it is limited to 1e entropy per level of the tree.
0≤−ρilnρi≤1e
The branching factor is mostly unimportant. The bulk of our entropy comes from the recursive component.
Recursive component ρiHμ,i
Fix ρi at a positive value. There is no limit to how big Hμ,i can become. You can make it arbitrarily large just by adding parameters. Consequently ρiHμ,i can become arbitrarily large too. In real world situations we should expect the recursive components of our hypothesis space to dominate the branching components.
If ρi vanishes then the recursive component disappears. This might explain why human minds like to round "extremely unlikely" ϵ>ρi>0 to "impossible" ρi=0 when Hμ,i is large. It removes lots of entropy from our hypothesis space still being right almost all of the time. This may be related to synaptic pruning.
Lessons for Hypothesis Space Design
Once again, we have confirmed that having hypotheses with lots of parameters is a worse problem that having lots of hypotheses to choose between. More generally, one or more hypotheses with exceptionally high entropy dominate the total entropy of your hypothesis space. If you want better priors then the first step of your optimization should be to eliminate these complex subtrees from your hypothesis space.
In Why quantitative finance is so hard I explained why the entropy of your dataset must exceed the entropy of your hypothesis space. I used a simple hypothesis space with n equally likely hypotheses each with m tunable parameters. Real life is not usually so homogeneous.
No Tunable Parameters
Consider an inhomogeneous hypothesis space with zero tunable parameters. Instead of H=logn which works for homogeneous hypothesis spaces, we must use more complicated entropy equation.
H=−n∑i=1ρilnρi
This equation makes intuitive sense. It vanishes when one ρi=j equals 1 and all other ρi≠j equal 0. Our equation is extremized when all ρi are equal at 1n. H=logn is the maximal case when ρi=1n∀i∈{1,…,n}[1].
With Tunable Parameters
Suppose each hypothesis i has mi tunable parameters. We can plug mi into our entropy equation.
H=n∑i=1ρi(mi−lnρi)
Our old equation H=m+logn is just the special case where all ρi are homogenous and mi are homogeneous too.
We have so far treated mi as representative of each hypothesis's tunable parameters. More generally, mi represents each hypothesis's internal entropy. If we think of hypotheses as a weighted tree, mi is what you get when you iterate one level down the tree. Our variable H identifies the root of the tree. Suppose ith branch of the next level down is called Hi.
H=n∑i=1ρi(Hi−lnρi)
We can define the entropy of the rest of the tree with a recursive equation.
Hμ=n∑i=1ρi(Hμ,i−lnρi)=n∑i=1(ρiHμ,i−ρilnρi)
There are two parts to this equation: the recursive component ρiHμ,i and the branching component −ρilnρi.
Branching component −ρilnρi
The −ρilnρi component is maximized when ρi=1e.
−ρilnρi=−1eln1e=1elne=1e
The branching component tops out at 1e. It can never contribute a massive quantity of entropy to your hypothesis space because it is limited to 1e entropy per level of the tree.
0≤−ρilnρi≤1e
The branching factor is mostly unimportant. The bulk of our entropy comes from the recursive component.
Recursive component ρiHμ,i
Fix ρi at a positive value. There is no limit to how big Hμ,i can become. You can make it arbitrarily large just by adding parameters. Consequently ρiHμ,i can become arbitrarily large too. In real world situations we should expect the recursive components of our hypothesis space to dominate the branching components.
If ρi vanishes then the recursive component disappears. This might explain why human minds like to round "extremely unlikely" ϵ>ρi>0 to "impossible" ρi=0 when Hμ,i is large. It removes lots of entropy from our hypothesis space still being right almost all of the time. This may be related to synaptic pruning.
Lessons for Hypothesis Space Design
Once again, we have confirmed that having hypotheses with lots of parameters is a worse problem that having lots of hypotheses to choose between. More generally, one or more hypotheses with exceptionally high entropy dominate the total entropy of your hypothesis space. If you want better priors then the first step of your optimization should be to eliminate these complex subtrees from your hypothesis space.
Proof: H=−n∑i=1ρilnρi=−n∑i=11nln1n=−ln1n=−ln(n−1)=lnn ↩︎