Comment Permalink

Leon Lang2y10

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence $Z_{n} = p (D_{n})$ , yep. It isn’t equal to $p ((Y_{i}) | (X_{i})),$ (I assume $i$ is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
$p (D_{n}) = \int_{W} φ (w) n \prod i = 1 p (y_{i}, x_{i} | w) d w$

and then under this supervised learning setup where we know $q (x_{i})$ , we have $p (y_{i}, x_{i} | w) = p (y_{i} | x_{i}, w) q (x_{i})$ . Also note that this does “factor over $i$ ” (if I’m interpreting you correctly) since the data is independent and identically distributed.

I think I still disagree. I think everything in these formulas needs to be conditioned on the $X$ -part of the dataset. In particular, I think the notation $p (D_{n})$ is slightly misleading, but maybe I'm missing something here.

I'll walk you through my reasoning: When I write $(X_{i})$ or $(Y_{i})$ , I mean the whole vectors, e.g., $(X_{i})_{i = 1, \dots, n}$ . Then I think the posterior compuation works as follows:

p (w ∣ D_{n}) = p (w ∣ (Y_{i}), (X_{i})) = \frac{p ((Y_{i}) ∣ (X_{i}), w) \cdot p (w ∣ (X_{i}))}{p ((Y_{i}) ∣ (X_{i}))} .

That is just Bayes rule, conditioned on $(X_{i})$ in every term. Then, $p (w ∣ (X_{i})) = φ (w)$ because from $X$ alone you don't get any new information about the conditional $q (Y ∣ X)$ (A more formal way to see this is to write down the Bayesian network of the model and to see that $w$ and $X_{i}$ are d-separated). Also, conditioned on $w$ , $p$ is independent over data points, and so we obtain

p (w ∣ D_{n}) = \frac{1}{p ((Y_{i}) ∣ (X_{i}))} \cdot e^{- n L_{n} (w)} \cdot φ (w) .

So, comparing with your equations, we must have $Z_{n} = p ((Y_{i}) ∣ (X_{i})) .$ Do you think this is correct?

Btw., I still don't think this "factors over $i$ ". I think that

$Z_{n} \neq \prod_{i = 1}^{n} p (Y_{i} ∣ X_{i}) .$

The reason is that old data points should inform the parameter $w$ , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on $w$ .

If you expand that term out you find that
$\begin{matrix} \int_{W} (w - w_{0})^{T} \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} \int_{W} (w - w_{0})^{T} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = 0 \end{matrix}$
because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.

Right. that makes sense, thank you! (I think you missed a factor of $n / 2$ , but that doesn't change the conclusion)

Thanks also for the corrected volume formula, it makes sense now :)

See in context

Distilling Singular Learning Theory

52 DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

by Liam Carroll

16th Jun 2023

AI Alignment Forum

16 min read

52 Ω 19

TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.

When a model class is singular (like neural networks), the complexity of a parameter in parameter space $W \subset R^{d}$ needs a new interpretation. Instead of being defined by the total parameters available to the model $d$ , the complexity (or effective dimensionality) of $w$ is defined by a positive rational $λ \in Q_{> 0}$ called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss $K (w)$ is fundamentally defined by the singularity structure of its minima, which $λ$ measures. Moreover, in regular models like linear regression the RLCT is $λ = \frac{d}{2}$ , but in singular models it satisfies $λ \leq \frac{d}{2}$ in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:

The RLCT $λ \in Q_{> 0}$ is the correct measure of effective dimensionality of a model $w \in W$ .

Watanabe shows that the RLCT $λ$ has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient".

In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular $K (w)$ loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.

Preliminaries of SLT

The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18].

SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space $W$ is the primary object of focus, containing information on which parameters $w \in W$ correspond to "good" models.

Our statistical learning setup consists of the following data:

A dataset $D_{n} = {(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}$ , where for $i = 1, \dots, n$ each $X_{i} \in R^{N}$ is an input and $Y_{i} \in R^{M}$ is an output (so we are in the supervised learning setting).
We suppose the sequence in $D_{n}$ is independent and identically distributed according to a true distribution $q (y, x) = q (y | x) q (x)$ . For our purposes, we assume the true distribution of inputs $q (x)$ to be known, but the true distribution of outputs $q (y | x)$ to be unknown.
We then choose a model class $p (y | x, w)$ defined by parameters $w$ in a compact parameter space $W \subseteq R^{d}$ that contains the origin. We hope to find model parameters $w$ that will adequately approximate the truth, or in other words, learn how to accurately predict an output given an input. For example, a model class could be a fixed neural network architecture with Gaussian noise, as below.
We can select a prior distribution $φ (w)$ of our choosing^[1] that is non-zero on $W$ , so $φ (w) > 0$ .

Using this data, the error of the model $w$ on the dataset $D_{n}$ is defined by the empirical negative log likelihood (NLL), $L_{n} (w)$ ,

L_{n} (w) = - \frac{1}{n} n \sum i = 1 log p (y_{i} | x_{i}, w),

where $e^{- n L_{n} (w)} = \prod_{i = 1}^{n} p (y_{i} | x_{i}, w) = p (D_{n} | w)$ is the model likelihood. ^[2] ^[3]

This gives rise to the Bayesian posterior of $w$ defined by ^[4]

p (w | D_{n}) := \frac{1}{Z_{n}} φ (w) e^{- n L_{n} (w)}

where the partition function (or in Bayesian terms the evidence) is given by

Z_{n} = \int_{W} φ (w) e^{- n L_{n} (w)} d w .

The partition function $Z_{n}$ measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy

F_{n} = - log Z_{n} .

Performing asymptotic analysis on $Z_{n}$ (and therefore $F_{n}$ ) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.

Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution

S_{n} := - \frac{1}{n} n \sum i = 1 log q (y_{i} | x_{i}) .

Even though this quantity is always inaccessible in real settings, there is almost sure convergence $S_{n} \to S$ as $n \to \infty$ to a constant $S$ that doesn't depend on $n$ , ^[5]

S = E_{X} [- log q (y | x)] = - \iint_{R^{N + M}} q (y, x) log q (y | x) d x d y,

To study the posterior, we define the Kullback-Leibler divergence $K (w)$ between the truth and the model,

K (w) := \iint_{R^{N + M}} q (y | x) q (x) log \frac{q (y | x)}{p (y | x, w)} d x d y,

which is the infinite-data limit of its empirical counterpart,

K_{n} (w) := \frac{1}{n} n \sum i = 1 log \frac{q (y_{i} | x_{i})}{p (y_{i} | x_{i}, w)} = L_{n} (w) - S_{n} .

The KL divergence is usually thought of as a "loss metric"^[6] between the truth and and the model since

$K (w) \geq 0$ for all $w \in W$ , and;
$K (w) = 0$ if and only if $p (y | x, w) = q (y | x)$ for all $x \in R^{N}$ and all $y \in R^{M}$ .

As such, I will colloquially refer to $K (w)$ as the loss landscape. ^[7] The current state of results in SLT require $K (w)$ to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints.

To analyse where the loss $K (w)$ is minimised, we are then lead to defining the set of true parameters,

W_{0} = {w \in W | K (w) = 0} = {w \in W | p (y | x, w) = q (y | x)} .

We say that the true distribution $q (y | x)$ is realisable by the model class $p (y | x, w)$ if $W_{0}$ is non empty, that is, there exists some $w^{(0)} \in W$ such that $q (y | x) := p (y | x, w^{(0)})$ for all $x, y$ . Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2.

We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network $f (x, w)$ in a following chapter of this sequence, but for now it suffices to say that it is a function $f : R^{N} \times W \to R^{M}$ with $N$ inputs and $M$ outputs defined by some parameter $w \in W$ . Then our model density is going to be given by

p (y | x, w) = \frac{1}{(2 π)^{\frac{M}{2}}} exp (- \frac{1}{2} ∥ y - f (x, w) ∥^{2}) .

From here on in, we will assume we are working with a (model, truth, prior) triple $(p (y | x, w), q (y | x), φ (w))$ as specified in this section.

Loss in our setting

To put these technical quantities into perspective, let me make clear two key points:

Under the regression model, the NLL is equivalent to the mean-squared error of the neural network $f (x, w)$ on the dataset $D_{n}$ (up to a constant),

L_{n} (w) = \frac{M}{2} log 2 π + \frac{1}{n} n \sum i = 1 \frac{1}{2} ∥ y_{i} - f (x_{i}, w) ∥^{2} .

In the realisable case where $q (y | x) = p (y | x, w^{(0)})$ , the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,

K (w) = \frac{1}{2} \int_{R^{N}} ∥ f (x, w) - f (x, w^{(0)}) ∥^{2} q (x) d x .

Singular vs Regular Models

What is a singular model?

The key quantity that distinguishes regular and singular models is the Fisher information matrix $I (w)$ , whose entries are defined by

I_{j, k} (w) = \iint_{R^{N + M}} (\frac{\partial}{\partial w_{j}} log p (y | x, w)) (\frac{\partial}{\partial w_{k}} log p (y | x, w)) p (y | x, w) q (x) d x d y .

It can be shown that when evaluated at a point on the set of true parameters $w^{(0)} \in W_{0}$ , the Fisher information matrix $I (w)$ is simply the Hessian of $K (w)$ , so

I_{j, k} (w^{(0)}) = \frac{\partial^{2}}{\partial w_{j} \partial w_{k}} K (w) {∣ ∣ ∣}_{w = w^{(0)}} .

A regular statistical model class is one which is identifiable (so $p (y | x, w_{1}) = p (y | x, w_{2})$ implies that $w_{1} = w_{2}$ ), and has positive definite Fisher information matrix $I (w)$ for all $w \in W$ . Regular model classes, such as standard linear regression, are the backbone of classical statistics for which all pre-exisiting literature on Bayesian statistics applies. But, from the point of view of SLT, regular model classes are... boring.

If a model class is not regular, then it is strictly singular. The non-identifiability condition can be easily dealt with, but it is the degeneracy of the Fisher information matrix that fundamentally changes the nature of the posterior and its asymptotics. We will say a model defined by a fixed $w^{(0)} \in W$ (not necessarily a true parameter) is strictly singular if the Fisher information at the point, $I (w^{(0)})$ , is degenerate, meaning

$r a n k (I (w^{(0)})) < d$ , where $d$ is the number of dimensions in parameter space $W \subset R^{d}$ , or equivalently;
$det I (w^{(0)}) = 0$ .

Then the model class is strictly singular if there exists a $w^{(0)} \in W$ such that $I (w^{(0)})$ is degenerate. A singular model class can be either regular or strictly singular - Watanabe's theory thus generalises regular models, regardless of the model non-identifiability or degenerate Fisher information.

It can be easily shown that, under the regression model, $I (w^{(0)})$ is degenerate if and only the set of derivatives

{\frac{\partial}{\partial w_{j}} f (x, w)}_{j = 1}^{d}

is linearly dependent.

In regular models, the set of true parameters $W_{0}$ consists of one point. But in singular models, the degeneracy of the Fisher information matrix means $W_{0}$ is not restricted to being one point, or even a set of isolated points - in general, these local minima of $K (w)$ are connected together in high-dimensional structures ^[8]. In strictly singular models, the true parameters are degenerate singularities ^[9] of $K (w)$ , and thus $K (w)$ cannot be approximated by a quadratic form near these points. This is the fundamental reason the classical theory of Bayesian statistics breaks down.

The set of true parameters in singular models looks like the left, whereas regular models look like the right. In singular models (left), $W_0$ can be curve, but in regular models (right), $W_0$ is a point. — In singular models (left), $W_{0}$ can be a curve, but in regular models (right), $W_{0}$ is a point.

Watanabe states that "in singular statistical models, the knowledge or grammar to be discovered corresponds to singularities in general" [Wat09]. With this in mind, it is unsurprising that the following widely used models are all examples of singular models:

Layered neural networks
Gaussian, binomial, multinomial and other mixture models
Reduced rank regression
Boltzmann machines
Bayes networks
Hidden Markov models

Singular models are characterised by features like: having hierarchical structure, being made of superposition of parametric functions, containing hidden variables, etc., all in the service of obtaining hidden knowledge from random samples.

Classical Bayesian inference breaks down for singular models

There are two key properties of regular models that are critical to Bayesian inference as $n \to \infty$ :

Asymptotic normality: The posterior of regular models converges in distribution to a $d$ -dimensional normal distribution centred at the maximum likelihood estimator $w^{(0)}$ [Vaa07]:

p (w | D_{n}) \to N_{d} (w^{(0)}, \frac{1}{n} I (w^{(0)})^{- 1}) .

Bayesian Information Criterion (BIC): The free energy of regular models asymptotically looks like the BIC as $n \to \infty$ , where $L_{n} (w^{(0)}) = {min}_{w \in W} L_{n} (w)$ and $d$ is the dimension of parameter space $W \subseteq R^{d}$ :

F_{n} \approx n L_{n} (w^{(0)}) + \frac{d}{2} log n = B I C .

At the core of both of these results is an asymptotic expansion that strongly depends on the Fisher information matrix $I (w)$ being non-degenerate at true parameters $w^{(0)} \in W_{0}$ . It's instructive to see why this is, so let's derive the BIC to see where $I (w)$ shows up.

Deriving the Bayesian Information Criterion only works for regular models

For the sake of this calculation, let us assume $W = R^{d}$ . Taking our cues from [Kon08], suppose $w^{(0)} \in W_{0}$ (thus is a maximum likelihood estimator and satisfies $L_{n} (w^{(0)}) = {min}_{w \in W} L_{n} (w)$ ). We can Taylor expand the NLL as

L_{n} (w) = L_{n} (w^{(0)}) + (w - w^{(0)})^{T} \frac{\partial L_{n} (w)}{\partial w} {∣ ∣}_{w = w^{(0)}} + \frac{1}{2} (w - w^{(0)})^{T} J (w^{(0)}) (w - w^{(0)}) + \dots

where $J (w^{(0)}) = \frac{\partial^{2} L_{n} (w)}{\partial w \partial w^{T}} {∣ ∣}_{w = w^{(0)}}$ is the Hessian. Since we are analysing the asymptotic limit $n \to \infty$ , we can relate this Hessian to the Fisher information matrix,

\begin{matrix} J (w^{(0)}) & = \frac{\partial^{2} L_{n} (w)}{\partial w \partial w^{T}} {∣ ∣}_{w = w^{(0)}} = \frac{\partial^{2} (K_{n} (w) + S_{n})}{\partial w \partial w^{T}} {∣ ∣}_{w = w^{(0)}} \approx \frac{\partial^{2} K (w)}{\partial w \partial w^{T}} {∣ ∣}_{w = w^{(0)}} as n \to \infty = I (w^{(0)}) . \end{matrix}

By definition $w^{(0)}$ is a minimum of $L_{n} (w)$ , so $\frac{\partial L_{n} (w)}{\partial w} {∣ ∣}_{w = w^{(0)}} = 0$ , so we can expand the partition function as

\begin{matrix} Z_{n} & = \int_{W} e^{- n L_{n} (w)} φ (w) d w = \int_{W} exp (- n L_{n} (w^{(0)}) - \frac{n}{2} (w - w^{(0)})^{T} I (w^{(0)}) (w - w^{(0)}) + \dots) \times [φ (w^{(0)}) + (w - w^{(0)})^{T} \frac{\partial φ (w)}{\partial w} {∣ ∣ ∣}_{w = w^{(0)}} + \dots] d w . \end{matrix}

Here's the crux: if $I (w^{(0)})$ is non-degenerate (so the model is regular), then we can perform this integral in good-faith knowing that it will always exist. In that case, the second term involving $\frac{\partial φ (w)}{\partial w}$ vanishes since it is the first central moment of a normal distribution, so we have

\begin{matrix} Z_{n} & \approx exp (- n L_{n} (w^{(0)})) φ (w^{(0)}) \int_{W} exp (- \frac{n}{2} (w - w^{(0)})^{T} I (w^{(0)}) (w - w^{(0)})) d w = \frac{exp (- n L_{n} (w^{(0)})) φ (w^{(0)}) (2 π)^{\frac{d}{2}}}{n^{\frac{d}{2}} \sqrt{det I (w^{(0)})}} \end{matrix}

since the integrand is the integral of a $d$ -dimensional multivariate Gaussian $N_{d} (w^{(0)}, \frac{1}{n} I (w^{(0)})^{- 1})$ . Notice here that this is the same distribution that arises in the asymptotic normality result, a theorem that has the same core, but requires more rigorous probability theory to prove. If $I (w^{(0)})$ is degenerate, then it is non-invertible, meaning the above formulas cannot hold.

The free energy of this ensemble is thus

F_{n} = - log Z_{n} = n L_{n} (w^{(0)}) + \frac{d}{2} log n - log φ (w^{(0)}) - \frac{d}{2} log 2 π + \frac{1}{2} det I (w^{(0)}),

and so ignoring terms less than $O (1)$ in $n$ , we arrive at the Bayesian Information Criterion

B I C = n L_{n} (w^{(0)}) + \frac{d}{2} log n .

This quantity can be understood as an accuracy-complexity tradeoff, where the complexity of the model class is defined by $d$ . We will elaborate on this more in DSLT2 but for now, you should just believe that the Fisher information $I (w)$ is a big deal. Generalising this procedure (and therefore the BIC) for singular models, is the heart of SLT.

Examples of Singular Loss Landscapes

In essence, the Fisher information matrix $I (w)$ describes something about the effective dimensionality or complexity of a model $w$ . When a model class is regular, the effective dimensionality of every point is simply $d$ , the number of parameters available to the model. But in the singular case, a new notion of effective dimensionality is required to adequately describe the complexity of a model. We're now going to look at two cases of singular models ^[10] - or more precisely, loss landscapes that correspond to singular models - to motivate this generalisation. We'll start with the easier case where one or more parameters are genuinely "free".

Sometimes singularities are just free parameters

Example 1.1: Suppose we have $d = 2$ parameters afforded to a model such that $K (w_{1}, w_{2}) = w_{1}^{2}$ , which has a Hessian given by

J (w) = \frac{\partial^{2}}{\partial w^{T} \partial w} K (w) = (\begin{matrix} 2 & 0 0 & 0 \end{matrix}) .

Taking the critical point $w^{(0)} = (0, 0)$ , we have $I (w^{(0)}) = J (w^{(0)}) = J (w)$ and so $det I (w^{(0)}) = 0$ , thus the model is singular. In this case, since $K (0, w_{2}) = 0$ for all $w_{2}$ , we could simply throw out the free parameter $w_{2}$ and define a regular model with $d_{1} = 1$ parameters that has identical geometry $K (w_{1}) = w_{1}^{2}$ , and therefore defines the same input-output function, $f (x, (w_{1}, w_{2})) = f (x, w_{1})$ .

This example is called a minimally singular case. Suppose $W \subseteq R^{d}$ with integers $d_{1}, d_{2} > 0$ such that $d_{1} + d_{2} = d$ , and after some change of basis^[11] we may write a local expansion of $K (w)$ as the sum of $d_{1}$ squares,

K (w) = d_{1} \sum i = 1 c_{i} w_{i}^{2},

where $c_{1}, c_{2}, \dots, c_{d_{1}} > 0$ are positive coefficients. Then the Fisher information matrix has the form

I (w^{(0)}) = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} c_{1} & \dots & 0 & 0 ⋮ & ⋱ & ⋮ & ⋮ 0 & \dots & c_{d_{1}} & 0 0 & 0 & 0 & 0_{d_{2}} \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

where $0_{d_{2}}$ is the square $d_{2} \times d_{2}$ zero matrix. Perhaps then we could define the "effective dimensionality" of $w^{(0)}$ as being $r a n k (I (w^{(0)})) = d_{1}$ , which is the number of tangent directions in parameter space in which the model changes - the number of "non-free" parameters - and just discard the $d_{2}$ "free" parameters that are normal to $W_{0} .$

Sure! We can do that, and if we did, our BIC derivation would carry out fine and we would just replace $d$ by $d_{1}$ in the final formula. So the minimally singular case is easy to handle.

But this doesn't always work.

But not all singularities are free parameters

Defining the effective dimensionality at $w^{(0)}$ as $r a n k (I (w^{(0)}))$ seems nice in theory, but turns out to give nonsensical answers pretty quickly - it is not a full enough description of the actual geometry at play.

Example 1.2: Suppose instead that $K (w_{1}, w_{2}) = \frac{1}{2} w_{1}^{2} w_{2}^{2}$ . Then the Hessian is

J (w) = (\begin{matrix} w_{2}^{2} & 2 w_{1} w_{2} 2 w_{1} w_{2} & w_{1}^{2} \end{matrix}) .

At the critical point $w^{(0)} = (0, 0)$ the Fisher information is

I (w^{(0)}) = H (w^{(0)}) = (\begin{matrix} 0 & 0 0 & 0 \end{matrix}),

which is obviously degenerate.

Zero effective dimensionality — The KL divergence for Example 2.2 with $r a n k (I (w^{(0)})) = 0$ looks like the intersection of multiple flat valleys.

If we used our notion of effective dimensionality from before, we would say the model defined by $w^{(0)}$ had effective dimension of $r a n k (I (w^{(0)})) = 0$ . But this would be ridiculous - clearly there are more than zero "effective dimensions" in this model, a term that would intuitively imply $K (w)$ was identically zero, which it clearly is not. Thus, we need a different way of thinking about effective dimensionality.

The Real Log Canonical Threshold (aka the Learning Coefficient)

In this section we are going to explain the key claim of this post: that effective dimensionality in singular models is measured by a positive rational number called the Real Log Canonical Threshold, also known as the learning coefficient.

Dimensionality as a volume co-dimension

Taking inspiration from Weyl's famous Volume of Tubes paper, we can reframe dimensionality in terms of a scaling exponent of the volume of "nearly" true parameters. To explain this, we will generalise the minimally singular case above. The following discussion follows [Wei, 22].

Assume we have a partition as before with $d_{1}, d_{2} \in N_{\geq 0}$ such that $d_{1} + d_{2} = d$ , where $d_{1}$ is the number of non-free parameters and $d_{2}$ is the number of free parameters. For any $ε > 0$ we can consider the set of almost true parameters centred at $w^{(0)}$ (which, without loss of generality, we will take to be $w^{(0)} = 0$ ),

W_{ε} = {w \in W | K (w) < ε}

and an associated volume function

V (ε) = \int_{W_{ε}} φ (w) d w .

Volume integral — $V (ε)$ for $K (w) = \frac{1}{2} w_{1}^{2} w_{2}^{2}$ for different $ε$ level sets.

As long as the prior $φ (w)$ is non-zero on $W_{0}$ it does not affect the relevant features of the volume, so we may assume that it is a constant $C$ in the first $d_{1}$ directions and is a normal distribution in the remaining $d_{2}$ . Then since $K (w) \approx \sum_{i = 1}^{d_{1}} c_{i} w_{i}^{2}$ , we can write

V (ε) = \int_{{w \in W | \sum_{i = 1}^{d_{1}} c_{i} w_{i}^{2} < ε}} C d w_{1} \dots d w_{d_{1}} \int_{R^{d_{2}}} e^{- \frac{1}{2} (w_{d_{1} + 1}^{2} + \dots + w_{d}^{2})} d w_{d_{1} + 1} \dots d w_{d} .

The right integrand is some constant $A$ that doesn't depend on $ε$ , and for the left we can make the substitution $u_{i} = \sqrt{\frac{c_{i}}{ε}} w_{i}$ , hence

V (ε) = A C \int_{{u \in U | \sum_{i = 1}^{d_{1}} u_{i}^{2} < 1}} \sqrt{\frac{ε}{c_{1}}} \dots \sqrt{\frac{ε}{c_{d_{1}}}} d u_{1} \dots d u_{d_{1}} .

Recognising the integrand as the volume of the $d_{1}$ -ball, a constant $B$ that does not depend on $ε$ , we see that

V (ε) \propto \frac{ε^{\frac{d_{1}}{2}}}{\sqrt{c_{1} \dots c_{d_{1}}}} .

Then the dimension $d_{1}$ arises as the scaling exponent of $ε^{\frac{1}{2}}$ , which can be extracted via the following ratio of volumes formula for some $a \in (0, 1)$ :

d_{1} = 2 lim ε \to 0 \frac{log (V (a ε) / V (ε))}{log a} .

This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.

Watanabe shows in Theorem 7.1 of [Wat09] that in general, for any singular model defined by $w^{(0)}$ , the volume integral centred at $w^{(0)}$ has the form

V (ε) = c ε^{λ} + o (ε^{λ})

where $λ \in Q_{> 0}$ is a positive rational number called the Real Log Canonical Threshold (RLCT) associated to the "most singular point" in $W_{0}$ . This is the quantity that generalises the dimensionality of a singularity. What's more, different singularities in $W_{0}$ can have different RLCT values, and thereby different "effective dimensionalities". As suggested above, the RLCT can then be defined by this volume formula:

λ = 2 lim ε \to 0 \frac{log (V (a ε) / V (ε))}{log a} .

An example of fractional dimension

Example 1.3: To build intuition for what a "fractional dimension" is, consider a model with $d = 1$ parameters with KL divergence given by $K (w) = w^{4}$ , which is singular since $\frac{\partial^{2} K}{\partial w^{2}} {∣ ∣}_{w = 0} = 0$ . A simple calculation shows that for this KL divergence,

V (ε) \propto ε^{\frac{1}{4}}

meaning $λ = \frac{1}{4}$ and so the "effective dimensionality" is $2 λ = \frac{1}{2}$ .

Meanwhile, in the $K (w) = w^{2}$ case, $V (ε) \propto ε^{\frac{1}{2}}$ , so the effective dimensionality is 1.

Effective dimension from a volume integral — Comparing the effective dimensionality as the scaling exponent of $ε$ for different models.

The RLCT can be read off when $K (w)$ is in normal crossing form

I may have presented the previous section suggesting that the RLCT is trivial to calculate. In general, this couldn't be further from the truth. But in one special case, it is. For this discussion we will ignore the prior, i.e. we will set it to be uniform on $W$ .

One dimensional case

As we just saw in Example 1.3, in the one dimensional case where $K (w) = w^{2 k}$ for some $k \geq 1$ , the RLCT is simply $λ = \frac{1}{2 k}$ . In fact, if we can express $K (w)$ in the form

K (w) = (w - c_{1})^{2 k_{1}} \dots (w - c_{J})^{2 k_{J}}

for non-negative integers $k_{1}, \dots, k_{j}$ and unique $c_{1}, \dots, c_{j} \in R$ , then the RLCT associated to each singularity $c_{j}$ is simply $λ_{j} = \frac{1}{2 k_{j}}$ . But, Watanabe shows that it is the smallest local RLCT (and thus the highest exponent in $K (w)$ ) that dominates the free energy, thus defining the global RLCT $λ$ where

λ = min j = 1, \dots, J (\frac{1}{2 k_{j}}) .

Example 1.4 This example is going to be very relevant in DSLT2. If we have

K (w) = (w + 1)^{2} w^{4},

with true parameters $w_{- 1}^{(0)} = - 1$ and $w_{1}^{(0)} = 1$ , then the local RLCT associated to each singularity is

λ_{- 1} = \frac{1}{2} and λ_{1} = \frac{1}{4} .

The global RLCT is thus $λ = λ_{1} .$

Multidimensional case

Suppose now that $d > 1$ so $K (w) = K (w_{1}, \dots, w_{d})$ . Suppose without loss of generality that $w^{(0)} = 0$ is a true parameter for $K (w)$ . If we can write the KL divergence in normal crossing form near $w^{(0)}$ ,

K (w) = w_{1}^{2 k_{1}} \dots w_{d}^{2 k_{d}}

then the RLCT is given by

λ = min j = 1, \dots, d (\frac{1}{2 k_{j}}) .

The multiplicity $m_{j}$ of each coordinate is the number of elements in ${k_{1}, \dots, k_{d}}$ that equal $k_{j} .$

This generalises this above case in the following sense:

Example 1.5 Suppose now that we have a two dimensional KL divergence of the form

K (w_{1}, w_{2}) = (w_{1} + 1)^{2} w_{1}^{4} w_{2}^{2} .

Then, in a neighbourhood of the singularity $w_{0}^{(0)} = (0, 0)$ , the KL divergence is approximately

K (w) \propto w_{1}^{4} w_{2}^{2} .

Thus, the RLCT associated to $w_{0}^{(0)}$ is

λ_{0} = \frac{1}{4},

with multiplicity $m_{0} = 1.$

On the other hand, near the singularity $w_{- 1}^{(0)} = (- 1, 0)$ the KL divergence is, up to a prefactor, approximately

K (w) \approx (w_{1} + 1)^{2} w_{2}^{2}

so the RLCT associated to $w_{- 1}^{(0)}$ is

λ_{- 1} = \frac{1}{2}

with multiplicity $m_{- 1} = 2$ . So, in this case the global RLCT is $λ = λ_{0}$ , which we will see in DSLT2 means that the posterior is most concentrated around the singularity $w_{0}^{(0)}$ .

Resolution of Singularities

In Algebraic Geometry and Statistical Learning Theory, Watanabe shows that algebraic geometry plays a central role in governing the behaviour of statistical models, and a highly non-trivial one in singular models especially. This rich connection between these two deep mathematical fields is, in my eyes, both profound and extremely beautiful.

The remarkable insight of Watanabe is that in fact any KL divergence, under appropriate hypotheses (such as analyticity), can be written in normal crossing form near a singularity of $K (w)$ . To do so, he invokes one of the fundamental theorems of algebraic geometry: Hironaka's Resolution of Singularities. The content of this theorem and its implications go well beyond the scope of this sequence. But, I will briefly mention its role in the theory as it relates to the RLCT. For a more detailed introduction to this part of the story, see [Wat09, Section 1.4].

The theorem guarantees the existence of a $d$ -dimensional analytic manifold $M$ and a real analytic map

g : M ∋ u \mapsto w \in W

such that for each coordinate $M_{α}$ of $M$ one can write

\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} and φ (g (u)) | g^{'} (u) | & = ϕ (u) | u_{1}^{h_{1}} \dots u_{d}^{h_{d}} | \end{matrix}

where each $k_{1}, \dots, k_{d}$ and $h_{1}, \dots, h_{d}$ are non-negative integers, $| g^{'} (u) |$ is the Jacobian determinant of $w = g (u)$ and $ϕ (u) > 0$ is a real analytic function. The global RLCT is then defined by

λ = min α min j = 1, \dots, d (\frac{h_{j} + 1}{2 k_{j}}),

and the global multiplicity is the maximum multiplicity over $α$ .

From this point on in the sequence, when you see the word "desingularise", what you should think is "put $K (w)$ into normal crossing form near a singularity".

The RLCT measures the effective dimensionality of a model

Succinctly, the RLCT $λ \in Q_{> 0}$ of a singularity $w^{(0)} \in W_{0} \subseteq W \subseteq R^{d}$ generalises the idea of dimension because:

If a model defined by $w^{(0)}$ is regular, then

λ = \frac{d}{2} .

If a model defined by $w^{(0)}$ is minimally singular where $d_{1} < d$ is the number of non-free parameters, then

λ = \frac{d_{1}}{2} .

In general, for any singular model the RLCT satisfies (by Theorem 7.2 of [Wat09])

λ \leq \frac{d}{2} .

In particular, if there are $d_{1} < d$ non-free parameters then

λ \leq \frac{d_{1}}{2} .

In order to find the asymptotic form of the free energy $F_{n}$ as $n \to \infty$ , Watanabe desingularises $K (w)$ near each singularity using the Resolution of Singularities. The RLCT then directly substitutes into the place of $\frac{d}{2}$ in the BIC formula, which gives rise to the Widely Applicable Bayesian Information Criterion (WBIC)

W B I C := n L_{n} (w^{(0)}) + λ log n .

In DSLT2, we will explain the implications of the WBIC and what it tells us about the profound differences between regular and singular models.

Appendix 1 - The other definition of the RLCT

In this post we have defined the RLCT as the scaling exponent of the volume integral of nearly true parameters. This result, whilst the most intuitive, is presented in the reverse order to how Watanabe originally defines the RLCT in [Wat09]. Alternatively, we can consider the zeta function

ζ (z) = \int_{W} K (w)^{z} φ (w) d w,

and show that it has a Laurent series given by

ζ (z) = ζ_{0} (z) + \infty \sum k = 1 m_{k} \sum m = 1 \frac{c_{k m}}{(z + λ_{k})^{m}}

where $ζ_{0} (z)$ is a holomorphic function, $c_{k m} \in C$ are coefficients, each $λ_{k} \in Q_{> 0}$ is ordered such that $0 < λ_{1} < λ_{2} < \dots$ , and $m_{k} \in N$ is the largest order of the pole $λ_{k}$ .

Then the Real Log Canonical Threshold of our (model, truth, prior) triple is $λ = λ_{1}$ with multiplicity $m = m_{1}$ .

This $ζ (z)$ is a key piece of machinery in using distribution theory to expand the partition function $Z_{n}$ . In the end, the smallest $λ_{1}$ and its multiplicity $m_{1}$ are the dominant terms in the expansion, and a further calculation in [Wat09, Theorem 7.1] shows how $V (ε) \propto ε^{λ}$ .

To see why $ζ (z)$ is necessary, and why this definition of the RLCT matters to the free energy formula proof, see the sketch of the proof in [Wat09, pg31-34].

References

[Car21] - Liam Carroll, Phase Transitions in Neural Networks (thesis)

[Wat09] - Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory (book)

[Wat18] - Sumio Watanabe, Mathematical Theory of Bayesian Statistics (book)

[KK08] - Konishi, Kitagawa, Information Criteria and Statistical Modelling (book)

[Vaa07] - van der Vaart, Asymptotic Statistics (book)

[Wei22] - Susan Wei, Daniel Murfet et al., Deep Learning is Singular, and That's Good (paper)

^{^}
In the finite $n$ case, the choice of prior $φ (w)$ is a philosophical matter, as well as a mathematical tractability matter. But as $n \to \infty$ , most results in Bayesian statistics show $φ (w)$ to be irrelevant so long as it satisfies some reasonable conditions. This is also true in SLT. ↩︎
^{^}
This should remind you of the Gibbs ensemble from statistical physics - not coincidentally, either. ↩︎
^{^}
For theoretical, philosophical, and computational purposes, we also define the tempered posterior to be
$p^{β} (w | D_{n}) := \frac{1}{Z_{n}^{β}} φ (w) e^{- n β L_{n} (w)}, where Z_{n}^{β} = \int_{W} φ (w) e^{- n β L_{n} (w)} d w .$
where $β > 0$ is the inverse temperature. This $β$ plays an important role in deriving the free energy formula and can be thought of as controlling the "skinniness" of the posterior. In our regression model below, it is actually the inverse variance of the Gaussian noise. ↩︎
^{^}
By Bayes' rule we have $p (w | D_{n}) = \frac{p (D_{n} | w) φ (w)}{p (D_{n})}$ . The form written here follows from some simplification of terms and redefinitions, see page 10 of the thesis.
^{^}
We can define an expectation over the dataset $D_{n}$ for some function $g (X, Y)$ as
$E_{X} [g (X, Y)] = \iint_{R^{N + M}} g (x, y) q (y, x) d x d y$
In particular, we define the entropy of the true conditional distribution to be
$S = E_{X} [- log q (y | x)] = - \iint_{R^{N + M}} q (y, x) log q (y | x) d x d y,$
and the (non-empirical) negative log loss to be
$L (w) = E_{X} [- log p (y | x, w)] = - \iint_{R^{N + M}} q (y, x) log p (y | x, w) d x d y .$
It is easy to show that $E_{X} [S_{n}] = S$ and $E_{X} [L_{n} (w)] = L (w)$ , and so by the law of large numbers there is almost sure convergence $S_{n} \to S$ and $L_{n} (w) \to L (w)$ . Analogous definitions show
$K_{n} (w) = L_{n} (w) - S_{n} \to L (w) - S = K (w) .$
^{^}
Though it isn't a true metric due to its asymmetry in $p$ and $q$ , and since it doesn't satisfy the triangle inequality. ↩︎
^{^}
Note here that since $K (w) = L (w) - S$ , we can reasonably call both $K (w)$ and $L (w)$ the loss landscape since they differ only by a constant $S$ (as $n \to \infty$ ).
^{^}
Or more precisely, a real analytic set.
^{^}
Since $K (w^{(0)}) = 0$ , and $\nabla K (w^{(0)}) = 0$ , and $I (w^{(0)})$ is degenerate. ↩︎
^{^}
Based on $K (w) = \frac{1}{2} \int_{W} ∥ f (x, w) - f (x, w^{(0)} ∥^{2} q (x) d x$ , it is relatively easy to reconstruct a model that genuinely yields a given $K (w)$ function, so we may happily pretend we have said model when we pull such a loss function from thin air. ↩︎
^{^}
Which is guaranteed to exist since the Hessian is a real symmetric matrix (and thus so is $I (w^{(0)})$ ), so it can be diagonalised. ↩︎

Singular Learning TheoryLogic & Mathematics Probability & StatisticsAI

Frontpage

52 Ω 19

DSLT 0. Distilling Singular Learning Theory

7 comments78 karma

DSLT 2. Why Neural Networks obey Occam's Razor

14 comments24 karma

Mentioned in

105The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

92You’re Measuring Model Complexity Wrong

89Growth and Form in a Toy Model of Superposition

78DSLT 0. Distilling Singular Learning Theory

56Degeneracies are sticky for SGD

Load More (5/10)

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:16 AM

[-]Leon Lang2y30

Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments.

where the partition function (or in Bayesian terms the evidence) is given by

Should I think of this as being equal to $p ((Y_{i}) | (X_{i}))$ , and would you call this quantity $p (D_{n})$ ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model $p$ .

And to be clear: This does not factorize over $i$ because every data point informs $w$ and thereby the next data point, correct?

The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.

But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?

there is almost sure convergence $S_{n} \to S$ as $n \to \infty$ to a constant $S$ that doesn't depend on $n$ , ^[5]
$S = E_{X} [- log q (y | x)] = - \iint_{R^{N + M}} q (y, x) log q (y | x) d x d y,$

I think the first expression should either be an expectation over $X Y$ , or have the conditional entropy $H (Y | x)$ within the parantheses.

In the realisable case where $q (y | x) = p (y | x, w^{(0)})$ , the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,
$K (w) = \frac{1}{2} \int_{R^{N}} ∥ f (x, w) - f (x, w^{(0)}) ∥^{2} q (x) d x .$

I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over $y$ . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)

A regular statistical model class is one which is identifiable (so $p (y | x, w_{1}) = p (y | x, w_{2})$ implies that $w_{1} = w_{2}$ ), and has positive definite Fisher information matrix $I (w)$ for all $w \in W$ .

The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.

Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).

It can be easily shown that, under the regression model, $I (w^{(0)})$ is degenerate if and only the set of derivatives
${\frac{\partial}{\partial w_{j}} f (x, w)}_{j = 1}^{d}$

is linearly dependent.

What is $x$ in this formula? Is it fixed? Or do we average the derivatives over the input distribution?

Since every true parameter is a degenerate singularity^[9] of $K (w)$ , it cannot be approximated by a quadratic form.

Hhm, I thought having a singular model just means that some singularities are degenerate.

One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?

We can Taylor expand the NLL as
$L_{n} (w) = L_{n} (w^{(0)}) + (w - w^{(0)})^{T} \frac{\partial L_{n} (w)}{\partial w} + \frac{1}{2} (w - w^{(0)})^{T} J (w^{(0)}) (w - w^{(0)}) + \dots$

I think you forgot a $|_{w = w_{0}}$ in the term of degree 1.

In that case, the second term involving $\frac{\partial φ (w)}{\partial w}$ vanishes since it is the first central moment of a normal distribution

Could you explain why that is? I may have missed some assumption on $φ (w)$ or not paid attention to something.

In this case, since $K (0, w_{2}) = 0$ for all $w_{2}$ , we could simply throw out the free parameter $w_{2}$ and define a regular model with $d_{1} = 1$ parameters that has identical geometry $K (w_{1}) = w_{1}^{2}$ , and therefore defines the same input-output function, $f (x, (w_{1}, w_{2})) = f (x, w_{1})$ .

Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?

Then the dimension $d_{1}$ arises as the scaling exponent of $ε^{\frac{1}{2}}$ , which can be extracted via the following ratio of volumes formula for some $a \in (0, 1)$ :
$d_{1} = 2 lim ε \to 0 \frac{log (V (a ε)) / log (V (ε))}{log a} .$

This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.

Are you sure this is the correct formula? When I tried computing this by hand it resulted in $2 / log (a)$ , but maybe I made a mistake.

General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters $λ$ around $w_{0}$ , the more $K (w)$ blows up around $w_{0}$ in all directions because we get variation in all directions, and so the smaller the region where $K (w)$ is below $ϵ$ . So $λ$ contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small $ϵ$ .

So, in this case the global RLCT is $λ = λ_{0}$ , which we will see in DSLT2 means that the posterior is most concentrated around the singularity $w_{0}^{(0)}$ .

Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)

Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT.

[-]Liam Carroll2y60

Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.

Should I think of this as being equal to , and would you call this quantity $p (D_{n})$ ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model $p$ .

The partition function is equal to the model evidence $Z_{n} = p (D_{n})$ , yep. It isn’t equal to $p ((Y_{i}) | (X_{i})),$ (I assume $i$ is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),

p (D_{n}) = \int_{W} φ (w) n \prod i = 1 p (y_{i}, x_{i} | w) d w

and then under this supervised learning setup where we know $q (x_{i})$ , we have $p (y_{i}, x_{i} | w) = p (y_{i} | x_{i}, w) q (x_{i})$ . Also note that this does “factor over $i$ ” (if I’m interpreting you correctly) since the data is independent and identically distributed.

But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?

Yep, you caught me - I was one step ahead. The free energy over the whole space $W$ is still a very useful quantity as it tells you “how good” the best model in the model class is. But $F_{n}$ by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2.

I think the first expression should either be an expectation over $X Y$ , or have the conditional entropy $H (Y | x)$ within the parantheses.

Ah, yes, you are right - this is a notational hangover from my thesis where I defined $E_{X}$ to be equal to expectation with respect to the true distribution $q (y, x)$ . (Things get a little bit sloppy when you have this known $q (x)$ floating around everywhere - you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).

I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over $y$ . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)

See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.

The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular?

Yep, the rest of the article does focus on the case where the Fisher information matrix is degenerate because it is far more interesting and gives rise to an interesting singularity structure (i.e. most of the time it will yield an RLCT $λ < \frac{d}{2}$ ). Unless my topology is horrendously mistaken, if one has a singular model class for which every parameter has a positive definite Fisher information, then this implies the non-identifiability condition simply means you have a set of isolated points $w_{1}, \dots, w_{n}$ that all have the same RLCT $\frac{d}{2}$ . Thus the free energy will only depend on their inaccuracy $L_{n} (w)$ , meaning every optimal parameter has the same free energy - not particularly interesting! An example of this would be something like the permutation symmetry of ReLU neural networks that I discuss in DSLT3.

I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).

I have clarified the terminology in the section where they are defined - thanks for picking me up on that. In particular, a singular model class can be either strictly singular or regular - Watanabe’s results hold regardless of identifiability or the degeneracy of the Fisher information. (Sometimes I might accidentally use the word "singular" to emphasise a model which "has non-regular points" - the context should make it relatively clear).

What is $x$ in this formula? Is it fixed? Or do we average the derivatives over the input distribution?

Refer to Theorem 3.1 and Lemma 3.2 in my thesis. The Fisher information involves an integral wrt $q (x) d x$ , so the Fisher information is degenerate iff that set is dependent as a function of $x$ , in other words, for all $x$ values in the domain specified by $q (x)$ (well, more precisely, for all non-measure-zero regions as specified by $q (x)$ ).

Hhm, I thought having a singular model just means that some singularities are degenerate.

Typo - thanks for that.

One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?

Correct! When we use the word “singularity”, we are specifically referring to singularities of $K (w)$ in the sense of algebraic geometry, so they are zeroes (or zeroes of a level set), and critical points with $\nabla K (w^{(0)}) = 0$ . So, even in regular models, the single optimal parameter is a singularity of $K (w)$ - it just a really, really uninteresting one. In SLT, every singularity needs to be put into normal crossing form via the resolution of singularities, regardless of whether it is a singularity in the sense that you describe (drawing self-intersecting curves, looking at cusps, etc.). But for cartoon purposes, those sorts of curves are good visualisation tools.

I think you forgot a $|_{w = w_{0}}$ in the term of degree 1.

Typo - thanks.

Could you explain why that is? I may have missed some assumption on $φ (w)$ or not paid attention to something.

If you expand that term out you find that

\begin{matrix} \int_{W} (w - w_{0})^{T} \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} \int_{W} (w - w_{0})^{T} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = 0 \end{matrix}

because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.

Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?

This is a fair question. When concerning the zeroes, by the formula for $K (w)$ when the truth is realisable one shows that

W_{0} = {w | f (x, w) = f (x, w_{0})},

so any path in the set of true parameters (i.e. in this case the set $W_{0} = {(w_{1}, w_{2}) | w_{0} = 0 and w_{2} \in R}$ ) will indeed produce the same input-output function. In general (away from the zeroes of $K (w)$ ), I don’t think this is necessarily true but I’d have to think a bit harder about it. In this pathological case it is, but I wouldn’t get bogged down in it - I’m just saying “ $K (w)$ tells us one parameter can literally be thrown out without changing anything about the model”. (Note here that $w_{2}$ is literally a free parameter across all of $W$ ).

Are you sure this is the correct formula? When I tried computing this by hand it resulted in $2 / log (a)$ , but maybe I made a mistake.

Ah! Another typo - thank you very much. It should be

λ = 2 lim ε \to 0 \frac{log (V (a ε) / V (ε))}{log a} .

General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters $λ$ around $w_{0}$ , the more $K (w)$ blows up around $w_{0}$ in all directions because we get variation in all directions, and so the smaller the region where $K (w)$ is below $ϵ$ . So $λ$ contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small $ϵ$ .

I think that's a very reasonable intuition to have, yep! Moreover, if one wants to compare the "flatness" between $\frac{1}{10} w^{2}$ versus $w^{4}$ , the point is that within a small neighbourhood of the singularity, a higher exponent (RLCTs of $\frac{1}{2}$ and $\frac{1}{4}$ respectively here) is "much flatter" than a low coefficient (the $\frac{1}{10}$ ). This is what the RLCT is picking up.

Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this?

We do expect that SGD is roughly equivalent to sampling from the Bayesian posterior and therefore that it moves towards regions of low RLCT, yes! But this is nonetheless just a postulate for the moment. If one treats $K (w)$ as a Hamiltonian energy function, then you can apply a full-throated physics lens to this entire setup (see DSLT4) and see that the critical points of $K (w)$ strongly affect the trajectories of the particles. Then the connection between SGD and SLT is really just the extent to which SGD is “acting like a particle subject to a Hamiltonian potential”. (A variant called SGLD seems to be just that, so maybe the question is under what conditions / to what extent does SGD = SGLD?). Running experiments that test whether variants of SGD end up in low RLCT regions of $K (w)$ is definitely a fruitful path forward.

[-]Leon Lang2y10

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence $Z_{n} = p (D_{n})$ , yep. It isn’t equal to $p ((Y_{i}) | (X_{i})),$ (I assume $i$ is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
$p (D_{n}) = \int_{W} φ (w) n \prod i = 1 p (y_{i}, x_{i} | w) d w$

and then under this supervised learning setup where we know $q (x_{i})$ , we have $p (y_{i}, x_{i} | w) = p (y_{i} | x_{i}, w) q (x_{i})$ . Also note that this does “factor over $i$ ” (if I’m interpreting you correctly) since the data is independent and identically distributed.

I'll walk you through my reasoning: When I write $(X_{i})$ or $(Y_{i})$ , I mean the whole vectors, e.g., $(X_{i})_{i = 1, \dots, n}$ . Then I think the posterior compuation works as follows:

p (w ∣ D_{n}) = p (w ∣ (Y_{i}), (X_{i})) = \frac{p ((Y_{i}) ∣ (X_{i}), w) \cdot p (w ∣ (X_{i}))}{p ((Y_{i}) ∣ (X_{i}))} .

p (w ∣ D_{n}) = \frac{1}{p ((Y_{i}) ∣ (X_{i}))} \cdot e^{- n L_{n} (w)} \cdot φ (w) .

So, comparing with your equations, we must have $Z_{n} = p ((Y_{i}) ∣ (X_{i})) .$ Do you think this is correct?

Btw., I still don't think this "factors over $i$ ". I think that

$Z_{n} \neq \prod_{i = 1}^{n} p (Y_{i} ∣ X_{i}) .$

If you expand that term out you find that
$\begin{matrix} \int_{W} (w - w_{0})^{T} \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} \int_{W} (w - w_{0})^{T} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = 0 \end{matrix}$
because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.

Right. that makes sense, thank you! (I think you missed a factor of $n / 2$ , but that doesn't change the conclusion)

Thanks also for the corrected volume formula, it makes sense now :)

[-]mfar2y41

I think these are helpful clarifying questions and comments from Leon. I saw Liam's response. I can add to some of Liam's answers about some of the definitions of singular models and singularities.

1. Conditions of regularity: Identifiability vs. regular Fisher information matrix

Liam: A regular statistical model class is one which is identifiable (so implies that $w_{1} = w_{2}$ ), and has positive definite Fisher information matrix $I (w)$ for all $w \in W$ .
Leon: The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular?

As Liam said, I think the answer is yes---the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are "singular", as "singular" is defined by Watanabe. But the emphasis is definitely on the 'more' singular models (with degenerate FIM) which is the most complex case and also includes neural networks.

As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.

Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single 'slice' of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability---it turns out that's not true for neural networks).

It's worth noting that these tricks don't generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM is degenerate at that true parameter, so again we are back in the degenerate FIM case.

2. Linear independence condition on Fisher information matrix degeneracy

Leon: What is $x$ in this formula [" ${\frac{\partial}{\partial w_{j}} f (x, w)}_{j = 1}^{d}$ is linearly independent"]? Is it fixed? Or do we average the derivatives over the input distribution?

Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis, $x$ is a free variable, and the condition is talking about linear dependence of functions of $x$ .

Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let $f (x, w) = (w_{1} + 2 w_{2}) x$ so that $\frac{\partial}{\partial w_{1}} f (x, w) = x$ and $\frac{\partial}{\partial w_{2}} f (x, w) = 2 x$ . Then let $g$ and $h$ be functions such that $g (x) = x$ and $h (x) = 2 x$ .. Then the set of functions ${g, h}$ is a linearly dependent set of functions because $h - 2 g = 0$ .

3. Singularities vs. visually obvious singularities (self-intersecting curves)

Leon: One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change [sic: 'derivative is zero', or 'loss does not change, right?]. Is this correct?

Right, as Liam said, often^[1] in SLT we are talking about singularities of the Kullback-Leiber loss function. Singularities of a function are defined as points where the function is zero and has zero gradient. Since $K$ is non-negative, all of its zeros are also local (actually global) minima, so they also have zero gradient. Among these singularities, some are 'more singular' than others. Liam pointed to the distinction between degenerate singularities and non-degenerate singularities. More generally, we can use the RLCT as a measure of 'how singular' a singularity is (lower RLCT = more singular).

As for the intuition about visually reasoning about singularities based on the picture of a zero set: I agree this is useful, but one should also keep in mind that it is not sufficient. These curves just shows the zero set, but the singularities (and their RLCTs) are defined not just based on the shape of the zero set but also based on the local shape of the function around the zero set.

Here's an example that might clarify. Consider two functions $J, K : R^{2} \to R$ such that $J (x, y) = x y$ and $K (x, y) = x^{2} y^{2}$ . Then these functions both have the same zero set ${(x, y) : x = 0 \lor y = 0}$ . That set has an intersection at the origin. Observe the following:

Both $J (0, 0) = 0$ and $\nabla J (0, 0) = \to 0$ , so the intersection is a singularity in the case of $J$ .
The other points on the zero set of $J$ are not singular. E.g. if $y = 0$ but $x \neq 0$ , then $\nabla J (x, 0) = (0, x) \neq \to 0$ .
Even though $K$ has the exact same zero set, all of its zeros are singular points! Observe $\nabla K (x, y) = (2 x y^{2}, 2 x^{2} y)$ , which is zero everywhere on the zero set.

In general, it's a true intuition that intersections of lines in zero sets correspond to singular points. But this example shows that whether non-intersecting points of the zero set are singular points depends on more than just the shape of the zero set itself.

In singular learning theory, the functions we consider are non-negative (Kullback--Leibler divergence), so you don't get functions like $J$ with non-critical zeros. However, the same argument here about existence of singularities could be extended to the danger of reasoning about the extent of singularity of singular points based on just looking at the shape of the zero set: the RLCT will depend on how the function behaves in the neighbourhood, not just on the zero set.

^{^}
One exception, you could say, is in the definition of strictly singular models. There, as we discussed, we had a condition involving the degeneracy of the Fisher information matrix (FIM) at a parameter. Degenerate matrix = non-invertible matrix = also called singular matrix. I think you could call these parameters 'singularities' (of the model).
One subtle point in this notion of singular parameter is that the definition of the FIM at a parameter $w$ involves setting the true parameter to $w$ . For a fixed true parameter, the set of singularities (zeros of KL loss wrt. that true parameter) will not generally coincide with the set of singularities (parameters where the FIM is degenerate).
Alternatively, you could consider the FIM condition in the definition of a non-regular model to be saying "if a model would have degenerate singularities at some parameter if that were the true parameter, then the model is non-regular".

[-]Leon Lang2y10

Thanks for the answer mfar!

Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis, is a free variable, and the condition is talking about linear dependence of functions of $x$ .
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let $f (x, w) = (w_{1} + 2 w_{2}) x$ so that $\frac{\partial}{\partial w_{1}} f (x, w) = x$ and $\frac{\partial}{\partial w_{2}} f (x, w) = 2 x$ . Then let $g$ and $h$ be functions such that $g (x) = x$ and $h (x) = 2 x$ .. Then the set of functions ${g, h}$ is a linearly dependent set of functions because $h - 2 g = 0$ .

Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam's thesis. Also thanks for your other comments!

[-]Roman Malov2mo10

and $w_{1}^{(0)} = 1$

Shouldn't the second singularity be at the point $w = 0$ ?

[-]jacob_drori10mo10

The theorem guarantees the existence of a -dimensional analytic manifold $M$ and a real analytic map

g : M ∋ u \mapsto w \in W

such that for each coordinate $M_{α}$ of $M$ one can write

\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} . . . \end{matrix}

I'm a bit confused here. First, I take it that $α$ labels coordinate patches? Second, consider the very simple case with $d = 2$ and $K (w) = w_{1}^{2} + w_{2}^{2}$ . What $g$ would put $K$ into the stated form?

[-]WCargo2y10

Hi, thank you for the sequence. Do you know if there is any way to get access the Watanabe’s book for free ?

[-]Daniel Murfet2y20

If the cost is a problem for you, send a postal address to daniel.murfet@gmail.com and I'll mail you my physical copy.

[-]Liam Carroll2y10

Only in the illegal ways, unfortunately. Perhaps your university has access?

Moderation Log

52

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

52

Ω 19

Preliminaries of SLT

Loss in our setting

Singular vs Regular Models

What is a singular model?

Classical Bayesian inference breaks down for singular models

Deriving the Bayesian Information Criterion only works for regular models

Examples of Singular Loss Landscapes

Sometimes singularities are just free parameters

But not all singularities are free parameters

The Real Log Canonical Threshold (aka the Learning Coefficient)

Dimensionality as a volume co-dimension

An example of fractional dimension

The RLCT can be read off when K(w) is in normal crossing form

One dimensional case

Multidimensional case

Resolution of Singularities

The RLCT measures the effective dimensionality of a model

Appendix 1 - The other definition of the RLCT

References

52

Ω 19

The RLCT can be read off when $K (w)$ is in normal crossing form