I think there's a mistake in 17: \sin(x) is not a diffeomorphism between (-\pi,\pi) and (-1,1) (since it is e.g. not bijective between these domains). Either you mean sin(x/2) or the interval bounds should be (-\pi/2, \pi/2)
There shouldn't be a negative sign here (14a).
(will edit this comment over time to collect typos as I find them)
Thanks to Jesse Hoogland and George Wang for feedback on these exercises.
In learning singular learning theory (SLT), I found it was often much easier to understand by working through examples, rather than try to work through the (fairly technical) theorems in their full generality. These exercises are an attempt to collect the sorts of examples that I worked through to understand SLT.
Before doing these exercises, you should have read the Distilling Singular Learning Theory (DSLT) sequence, watched the SLT summit YouTube videos, or studied something equivalent. DSLT is a good reference to keep open while solving these problems, perhaps alongside Watanabe's textbook, the Gray Book. Note that some of these exercises cover the basics, which are well-covered in the above distillations, but some deliver material which will likely be new to you (because it's buried deep in a textbook, because it's only found in adjacent literature, etc).
Exercises are presented mostly in conceptual order: later exercises freely use concepts developed in earlier exercises. Starred (*) exercises are what I consider the most essential exercises, and the ones I recommend you complete first.
Recall that the learning coefficient is a volume scaling exponent, such that V(ϵ)∝ϵλ [4] as ϵ→0. Based on this, interpret your results. How does this make the cubicly-parameterized normal model different from the ordinary normal model?
Even though the asymptotic learning coefficient (ϵ→0) only changes when μ0=0 exactly, note how the non-asymptotic volume (ϵ finite) is affected in a larger neighborhood.
This type of model is called minimally singular, for reasons that will become clear shortly. Minimally singular models can be dealt with easier than more general singular models.
Show that your answer from part b) matches Watanabe's formula.
For illustration, we will consider the parameterized normal model p(x|μ)=1√2πexp(−12(x−(μ(μ−2)2))2)
for true parameter μ0=0.
Still, much like e.g. air resistance in physics, moving from idealized population quantities to realistic empirical quantities adds new complications, but much of the fundamental intuitions continue to hold.
obtained as the log of marginalizing the tempered Bayesian posterior at inverse temperature β. Perhaps the most central result of SLT is the asymptotic expansion of the free energy:
Fn=nβSn+λlogn+O(loglog(n)),
where the empirical entropy Sn=−1n∑ilog(q(Xi)) [Gray Book, Main Formula II].
Suppose we have a (potentially nonlinear) regression model given by a map f:W→F from a parameter space W=Rd to a function space F with outputs in Rn, for which we use mean squared error loss.[10] We may write this as a statistical model: p(y|x,w)=N(f(w)(x),I)=(2π)−d/2exp(−12||y−f(w)(x)||2)
where N denotes the multivariate normal distribution and I is the identity matrix [See here].
Conclude that a regression model is singular at a parameter w∈W if and only if there exists a vector v∈W such that the directional derivative ∇vf(w)(x)=0 for all inputs x in the support of q(x).
In this example, what information is the learning coefficient giving you that is missing from the rank of the Fisher information matrix?
Let the random variable ~wβ be a sample from the tempered posterior ~wβ∼1Zne−nβLn(w)φ(w). Then Ln(~wβ) is a real-valued random variable giving the negative log-likelihood of a randomly sampled parameter from the tempered posterior.
To demonstrate this, redo 1c) and 1d) but with a different prior distribution that also has support at μ0.
References
Technically, if we follow Watanabe's terminology, a regular model is non-strictly-singular rather than non-singular. Watanabe defines singular models as a superset of regular models, so every regular model is also singular - non-regular models are referred to as strictly singular models.
Technically, the learning coefficient is not the same thing as the real log-canonical threshold (RLCT); the learning coefficient is an invariant of a statistical system (model, truth, prior triplet), whereas the RLCT is an invariant of an analytic function. However, the RLCT of the model/truth KL divergence coincides with the learning coefficient if the prior is supported along the true parameter set.
Note that in practice, analytically calculating or numerically estimating the learning coefficient directly via this volume scaling formula is completely intractable. Instead, methods based on the WBIC and MCMC are necessary.
In the case where the multiplicity m=1.
This kind of two-layer linear model is sometimes called reduced rank regression.
By 14d), this is equivalent to the null space of the FIM.
Note that the Gray Book uses Ln(w) to refer to the likelihood instead of the negative log-likelihood.
Note that this pointwise expected value only tells us about Ln(w) for a fixed w; it does not give us enough information to talk about properties of Ln(w) which depend on many w, like volume scaling, the free energy, etc. Establishing this is highly nontrivial, and the Gray Book spends a significant amount of time doing so.
Implicitly, with improper prior φ(w)=1, but the prior isn't important for intuition here.
A neural network under mean squared error loss would satisfy these assumptions.
In some cases, the rank of the Hessian may also be used here, given the correspondence between the FIM and the Hessian (see problem 16).