Nothing is "mere." I, too, can see the stars on a desert night, and feel them. But do I see less or more? The vastness of the heavens stretches my imagination - stuck on this carousel, my little eye can catch one-million-year-old light. A vast pattern - of which I am a part - perhaps my stuff was belched from some forgotten star, as one is belching there. Or see them with the greater eye of Palomar, rushing all apart from some common starting point when they were perhaps all together. What is the pattern, or the meaning, or the why? It does not do harm to the mystery to know a little about it.
- Richard P. Feynman on The Relation of Physics to Other Sciences
Notes and reflections on the things I've learned while Doing Scholarship the last two week (i.e. studying math).
Mostly the past two weeks were on differential geometry (Lee):
Rabbit holes that I could not afford to pursue:
Example of how reading books in parallel improves learning efficiency.
Why that long? The dimensionality reduction by projection is perhaps more nontrivial because of Sard, but the obvious gluing should have been sufficient to construct an immersion at least, albeit at the cost of inefficient codomain dimension. Maybe the historically difficult part was the concept of partition of unity and that it always exist in manifolds?
Thanks for the recommendation! Woit's book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg's Group Theory and Physics to be a good representation theory & physics book.
I did encounter Brown's book during my search for algebraic topology books but I had to pass it over Bredon's because it didn't develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.
Notes and reflections on the things I've learned while Doing Scholarship this week (i.e. studying math)[1].
This week, I'll start tracking the exercises I solve and pages I cover and post them in next week's shortform (EDIT: biweekly), so that I can keep track of my progress + additional accountability.
I am self-studying math. The purpose of this shortform is to publicly write down:
with the aim of:
I am currently reading the following textbooks:
and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week's shortform I haven't written down my progress this week on each of these books nor the problems I've solved because I haven't started tracking them, so I'll do them starting next week.
The RLCT[1] is a function of both and . The role of is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of really eluded me.
I think I now have some intuitive picture of how structure in influences RLCT (at least particular instances of it). Consider the following example.
Suppose the true distribution is (1) realizable ( for some ), (2) invariant under some group action, . Now, suppose that the model class is that of exponential models, i.e. . In particular, suppose that , the fixed feature map, is -equivariant, i.e. such that .
Claim: There is a degeneracy of the form , and in particular if is a Lie group, the rank upper bound of RLCT decreases by .
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
... and the latter claim on RLCT is a consequence of reducing the rank of at by together with the rank upper bound result here.
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
Basically, (1) realizablity imparts input-symmetry to , and (2) emulatability essentially "push-forwards" this to a symmetry in the parameters[4]. I think this is very interesting!
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: holds for ALL , since the -equivariant is independent of . A more realistic model would look something like where also depends on and importantly, whether satisfies -equivariance depends on the value of .
Then, if but makes -equivariant while doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space to an infinitesimal neighborhood (germ) of a particular optimal parameter we're interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function , meaning it is a function of the true distribution and the parametric model (the choice of the prior doesn't matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by and dividing the next by implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every ; more examples in Liam Carroll's thesis.
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
Let the input-side symmetry to be trivial (i.e. ), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
Found a proof sketch here (App. D.3), couldn't it find elsewhere in canonical SLT references eg gray book. Idea seems simple:
There shouldn't be a negative sign here (14a).
(will edit this comment over time to collect typos as I find them)
The fourth one is great.
Conventionally is a random variable, just like how is a random variable. To be fair the conventions are somewhat inconsistent, given that (as you said) is a number.
Previous discussion, comment by johnswentworth:
Relevant slogan: Goodheart is about generalization, not approximation.
[...]
In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.
This seems like a misleading example of doomers being wrong (agree denotationally, disagree connotationally), since I think it's plausible that Y2K was not a big deal (to such an extent that "most people think it was a myth, hoax, or urban legend") precisely because of the mitigation efforts stemmed by the doomsayers' predictions.