Simon Pepin Lehalleur

When Are Results from Computational Complexity Not Too Coarse?

There is another interesting connection between computation and bounded treewidth: the control flow graphs of programs written in languages "without goto instructions" have uniformly bounded treewidth (e.g. <7 for goto-free C programs). This is due to Thorup (1998):

https://www.sciencedirect.com/science/article/pii/S0890540197926973

Combined with graphs algorithms for bounded treewidth graphs, this has apparently been used in the analysis of compiler optimization and program verification problems, see the recent reference:

https://dl.acm.org/doi/abs/10.1145/3622807

which also proves a similar bound for pathwidth.

The absolute basics of representation theory of finite groups

Simon Pepin Lehalleur2mo10

Nice!

I would add the following, which is implicit in the presentation: this phenomenon of real representations is not specific to finite groups. Real irreducible representations of a group are always neatly divided into three types: real, complex or quaternionic. This is [Schur\'s lemma](https\://ncatlab\.org/nlab/show/Schur\%27s\+lemma\#statement) together with the fact that the real division algebras are exactly R, C and the quaternions H.

(Should ML interpretability people care about infinite groups to begin with - unlike mathematicians, who love them all? For once, models as well as datasets can exhibit (exact or approximate) continuous symmetries, and these symmetries be understood mathematically as actions of matrix Lie groups such as the group GL_n of all invertible matrices or the group O_n of n-dimensional rotations. Sometimes these actions are linear, so are themselves representations, and sometimes they can be studied by linearizing them. Using representation theory to study more general geometric group actions is one of those great tricks of mathematics which reduce complicated problems to linear algebra.)

Renormalization Redux: QFT Techniques for AI Interpretability

Simon Pepin Lehalleur2mo10

On 1., you should consider that, for people who don't know much about QFT and its relationship with SFT (like, say, me 18 months ago), it is not at all obvious that QFT can be applied beyond quantum systems!

In my case, the first time I read about "QFT for deep learning" I dismissed it automatically because I assumed it would involve some far-fetched analogies with quantum mechanics.

Renormalization Redux: QFT Techniques for AI Interpretability

Simon Pepin Lehalleur2mo10

but in fact you can also understand the theory on a fine-grained level near an impurity by a more careful form of renormalization, where you view the nearest several impurities as discrete sources and only coarsegrain far-away impurities as statistical noise.

Where could I read about this?

Renormalization Redux: QFT Techniques for AI Interpretability

Simon Pepin Lehalleur2mo110

Thanks a lot for writing this! Some clarifying questions:

In this context, is QFT roughly a shorthand for "statistical field theory, studied via the mathematical methods of Euclidean QFT"? Or do you expect intuitions from specifically quantum phenomena to play a role?
There is a community of statistical physicists who use techniques from statistical mechanics of disordered systems and phase transitions to study ML theory, mostly for simple systems (linear models, shallow networks) and simple data distributions (Gaussian data, student-teacher model with a similarly simple teacher). What do you think of this approach? How does it relate to what you have in mind?
Would this approach, at least when applied to the whole network, rely on an assumption that trained DNNs inherit from their initialization a relatively high level of "homogeneity" and relatively limited differentiation, compared say to biological organisms? For instance, as a silly thought experiment, suppose you had the same view into a tiger as you have a DNN: something like all the chemical-level data as a collection of time-series indexed by (spatially randomized) voxels, and you want to understand the behaviour of the tiger as function of the environment. How would you expect a QFT-based approach to proceed? What observables would it encoder first? Would it be able to go beyond the global thermodynamics of the tiger and say something about cell and tissue differentiation? How would it "put the tiger back together"? (Those are not gotcha questions - I don't really know if any existing interpretability method would get far in this setting!)

The Laws of Large Numbers

Simon Pepin Lehalleur2mo30

For sufficiently nice regular, 1-dimensional Bayesian models, Edgeworth-type asymptotic expansions for the Bayesian posterior have been derived in

https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-41/issue-3/Asymptotic-Expansions-Associated-with-Posterior-Distributions/10.1214/aoms/1177696963.full

The Laws of Large Numbers

Simon Pepin Lehalleur2mo30

Q: How can I use LaTeX in these comments? I tried to follow https://www.lesswrong.com/tag/guide-to-the-lesswrong-editor#LaTeX but it does not seem to render.

Here is the simplest case I know, which is a sum of dependent identically distributed variables. In physical terms, it is about the magnetisation of the 1d Curie-Weiss (=mean-field Ising) model. I follow the notation of the paper https://arxiv.org/abs/1409.2849 for ease of reference, this is roughly Theorem 8 + Theorem 10:

Let $M_n=\sum_{i=1}^n \sigma(i)$ be the sum of n dependent Bernouilli random variables $\sigma(i)\in\{\pm 1}$, where the joint distribution is given by

\mathbb{P}(\sigma)\sim \exp(\frac{\beta}{n}M_n^2))

Then

When $\beta=1$, the fluctuations of $M_n$ are very large and we have an anomalous CLT: $\frac{M_n}{n^{3/4}}$ converges in law to the probability distribution $\sim \exp(-frac{x^4}{12})$.
When $\beta<1$, $M_n$ satisfies a normal CLT: $\frac{M_n}{n^{1/2}}$ converges to a Gaussian.
When $\beta>1$, $M_n$ does not satisfy a limit theorem (there are two lower energy configurations)

In statistical mechanics, this is an old result of Ellis-Newman from 1978; the paper above puts it into a more systematic probabilistic framework, and proves finer results about the fluctuations (Theorems 16 and 17).

The physical intuition is that $\beta=1$ is the critical inverse temperature at which the 1d Curie-Weiss model goes through a continuous phase transition. In general, one should expect such anomalous CLTs in the thermodynamic limit of continuous phase transitions in statistical mechanics, with the shape of the CLT controlled by the Taylor expansion of the microcanonical entropy around the critical parameters. Indeed Ellis and his collaborators have worked out a number of such cases for various mean-field models (which according to Meliot-Nikeghbali also fit in their mod-Gaussian framework). It is of course very difficult to prove such results rigorously outside of mean-field models, since even proving that there is a phase transition is often out of reach.

A limitation of the Curie-Weiss result is that it is 1d and so the "singularity" is pretty limited. The Meliot-Nikeghbali paper has 2d and 3d generalisations where the singularities are a bit more interesting: see Theorem 11 and Equations (10) and (11). And here is another recent example from the stat mech literature

https://link.springer.com/article/10.1007/s10955-016-1667-9

You were actually asking about Edgeworth expansions rather than just the CLT. It may be that with this method of producing anomalous CLTs, starting with a nice mod-Gaussian convergent sequence and doing a change of measure, one could write down further terms in the expansion? I haven't thought about this.

Since the main result of SLT is roughly speaking an "anomalous CLT for the Bayesian posterior", I would love to use the results above to think of singular Bayesian statistical models as "at a continuous phase transition" (probably with quenched disorder to be more physically accurate), with the tuning to criticality coming from a combination of structure in data and hyperparameter tuning, but I don't really know what to do with this analogy!

Dmitry's Koan

Simon Pepin Lehalleur2mo10

I mentioned samples and expectations for the TLBP because it seems possible (and suggested by the role of degeneracies in SLT) that different samples can correspond to qualitatively different degradations of the model. Cartoon picture : besides the robust circuit X of interest, there are "fragile" circuits A and B, and most samples at a given loss scale degrade either A or B but not both.

I agree that there is no strong reason to overindex on the Watanabe temperature, which is derived from an idealised situation: global Bayesian inference, degeneracies exactly at the optimal parameters, "relatively finite variance", etc. The scale you propose seems quite natural but I will let LLC-practitioners comment on that.

Dmitry's Koan

Simon Pepin Lehalleur2mo50

Is the following a fair summary of the thread ~up to "Natural degradation" from the SLT persepctive?

Current SLT-inspired approaches are right to consider samples of the "tempered local Bayesian posterior" provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe's work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the "natural scale" heuristics laid out in the post.

The Laws of Large Numbers

Simon Pepin Lehalleur2mo100

A closely related perspective on fluctuations of sequences of random variables has been studied recently in pure probability theory under the name of "mod-Gaussian convergence" (and more generally "mod-phi convergence"). Mod-Gaussian convergence of a sequence of RVs or random vectors is just the right amount of control over the characteristic functions - or in a useful variant, the whole complex Laplace transforms - to imply a clean description of the fluctuations at various scales (CLT, Edgeworth expansion, "normality zone", local CLT, moderate deviations, sharp large deviations,...). Unsurprisingly, the theory is full of cumulants.

Here is a nice introduction with applications to statistical mechanics models:

https://arxiv.org/abs/1409.2849

and the book with the general theory (which I still have to read!)

https://link.springer.com/book/10.1007/978-3-319-46822-8

This leads for instance to a clean approach of some "anomalous" CLTs with non-Gaussian limit laws (not for the mod-Gaussian convergent sequences themselves but for modified versions thereof) for some stat mech models at continuous phase transitions, see Theorems 8 and 11 in the first reference above. As far as I know, those theorems are the simplest "SLT-like" phenomenon in probability theory!

LESSWRONG
LW

Posts

Wikitag Contributions

Comments