Alexander Gietelink Oldenziel

(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
                                                                                                           - Saharon Shelah

 

As a true-born Dutchman I endorse  Crocker's rules.

For my most of my writing see my short-forms (new shortform, old shortform)

Twitter: @FellowHominid

Personal website: https://sites.google.com/view/afdago/home

Sequences

Singular Learning Theory

Wiki Contributions

Comments

Sorted by

God is live and we have birthed him. 

It's still wild to me that highly cited papers in this space can make such elementary errors. 

Thank you for writing this post Dmitry. I've only skimmed the post but clearly it merits a deeper dive. 

I will now describe a powerful, central circle of ideas I've been obsessed with past year that I suspect is very close to the way you are thinking. 

Free energy functionals

There is a very powerful, very central idea whose simplicity is somehow lost in physics obscurantism which I will call for lack of a better word ' tempered free energy functionals'. 

Let us be given a loss function $L$ [physicists will prefer to think of this as an energy function/ Hamiltonian]. The idea is that one consider a functional $F_{L}(\beta): \Delta(\Omega) \to \mathbb{R}$ taking a distribution $p$ and sending it to $L(p) + \beta H(p)$, $\beta\in \mathbb{R}$ is the inherent coolness or inverse temperature. 

We are now interested in minimizers of this functional. The functional will typically be convex (e.g. if $L(p)=KL(q||p)$ the KL-divergence or $L(P)= NL_N(p)$, the empirical loss at $N$ data points) so it has a minimum. This is the tempered Bayesian posterior/ Boltzmann distribution at inverse temperature $\beta$. 

I find the physics terminology inherently confusing. So instead of the mysterious word temperature; just think of $\beta$ as a variable that controls the tradeoff between loss and inherent simplicity bias/noise. In other words, \beta controls the inherent noise.  

SLT of course describes the free energy functional when evaluated at this minimizer as a function of $N$ through the Watanabe free energy functional. 

Another piece of the story is that the [continuum limit of] stochastic gradient langevin descent at a given noise  level is equivalently gradient descent along the free energy functional [at the given noise level, in the Wasserstein metric]. 

Rate-distortion theory

Instead of a free energy functional we can better think of it as a complexity-accuracy functional. 

This is the basics of rate-distortion theory. I note that there is a very important but little known purely algorithmic version of this theory. See here for an expansive breakdown on more of these ideas. 

Working in this generality it can be shown that every phase transition diagram is possible. There are also connections with Natural Abstractions/ sufficient statistics and time complexity.

Like David Holmes I am not an expert in tropical geometry so I can't give the best case for why tropical geometry may be useful. Only a real expert putting in serious effort can make that case. 

Let me nevertheless respond to some of your claims. 

  • PL functions are quite natural for many reasons. They are simple. They naturally appear as minimizers of various optimization procedures, see e.g. the discussion in section 5 here.
  • Polynomials don't satisfy the padding argument and architectures based on them therefore will typically fail to have the correct simplicitity bias. 

As for

1." Algebraic geometry isn't good at dealing with deep composition of functions, and especially approximate composition."  I agree a typical course in algebraic geometry will not much consider composition of functions but that doesn't seem to me a strong argument for the contention that the tools of algebraic geometry are not relevant here. Certainly, more sophisticated methods beyond classical scheme theory may be important [likely involving something like PROPs] but ultimately I'm not aware of any fundamental obstruction here. 

 

2.   >> 
I don't agree with the contention that algebraic geometry is somehow not suited for questions of approximation. e.g. the Weil conjectures is really an approximate/ average statement about points of curves over finite fields. The same objection you make could have been made about singularity theory before we knew about SLT. 

I agree with you that a probabilistic perspective on ReLUs/ piece-wise linear functions is probably important. It doesn't seem unreasonable to me in the slightest to consider some sort of tempered posterior on the space of piecewise linear functions. I don't think this invalidates the potential of polytope-flavored thinking.

>> Tropical geometry is an interesting, mysterious and reasonable field in mathematics, used for systematically analyzing the asymptotic and "boundary" geometry of polynomial functions and solution sets in high-dimensional spaces, and related combinatorics (it's actually closely related to my graduate work and some logarithmic algebraic geometry work I did afterwards). It sometimes extends to other interesting asymptotic behaviors (like trees of genetic relatedness). The idea of applying this to partially linear functions appearing in ML is about as silly as trying to see DNA patterns in the arrangement of stars -- it's a total type mismatch. 

Shots fired! :D Afaik I'm the only tropical geometry stan in alignment so let me reply to this spicy takedown here. 

It's quite plausible to me that thinking in terms of polytopes, convex is a reasonable and potentially powerful lens on understanding neural networks. Despite the hyperconfident and strong language in this post it seems you agree. 

Is it then unreasonable to think that tropical geometry may be relevant too? I don't think so.  

Perhaps your contention is that tropical geometry is more than just thinking in terms of polytopes but specifically the algebraic geometric flavored techniques. Perhaps. I don't feel strongly about that. If it's matroids that are most relevant, rather than toric varieties and tropicalized Grassmanians then so be it. 

The basic tropical perspective on deep learning begins by observing ReLU neural networks as ' tropical rational functions' , i.e. decomposing the underlying map $f$ of your ReLU neural network as a difference of convex linear functions $f=g-h$. This decomposition isn't unique, but possibly still quite useful. 

As is mentioned in the text, convex-linear functions are much easier to analyze than general piece-wise linear functions so this decomposition may prove advantageous.

Another direction that may be of interest in this context is the nonsmooth calculus and especially its extension the quasi-differential calculus. 

" as silly trying to see DNA patterns in the arrangement of stars -- it's a total type mismatch" 

This statement feels deeply overconfident to me. Whether or not tropical geometry may be relevant to understanding real neural networks can only really be resolved by having a true domain expert ' commit to the bit' and research this deeply. 

This kind of idle speculation seems not so useful to me. 

You are probably aware of this but there is indeed a mathematical theory of degeneracy/ multiplicity in which multiplicity/degeneracy in the parameter-function map of neural networks is key to their simplicity bias. This is singular learning theory. 

The connection between degeneracy [SLT] and simplicity [algorithmic information theory]  is surprisingly, delightfully simple. It's given by the padding/deadcode argument. 

Beautifully argued, Dmitry. Couldn't agree more. 

I would also note that I consider the second problem of interpretability basically the central problem of complex systems theory. 

I consider the first problem a special case of the central probem of alignment. It's very closely related to the 'no free lunch'  problem. 

Thanks. 

Well 2-3 shitposters and one gwern. 

Who would be so foolish to short gwern? Gwern the farsighted, gwern the prophet, gwern for whom entropy is nought, gwern augurious augustus

Thanks for the sleuthing.

 

The thing is - last time I heard about OpenAI rumors it was Strawberry. 

The unfortunate fact of life is that too many times OpenAI shipping has surpassed all but the wildest speculations.

Load More