math terminology as convolution
=math =linguistics =thinking =philosophy
On the one hand, this theory generalizes the Fuchsian and Bers uniformizations of complex hyperbolic curves and their moduli to nonarchimedean places. It is for this reason that we shall often refer to this theory as p-adic Teichmuller theory, for short. On the other hand, the theory under discussion may be regarded as a fairly precise hyperbolic analogue of the Serre-Tate theory of ordinary abelian varieties and their moduli.
— Shinichi Mochizuki
I know some of these words.
— Ed in Good Burger (1997)
Math research papers are
notorious for using specialized and obscure terminology. Why is that? Why
can't they describe things in terms of simpler components?
Chemists
often talk about carbon atoms. They don't say "an atom with 6 protons, 6
neutrons, and 6 electrons". Those subatomic particles are grouped together
into a single conceptual item. The power of convolutional neural networks
shows us that such grouping is not merely a matter of convenience - rather,
the selection of which things to group together is a system of thinking.
Neural network research suggests a lot about how humans think. For
example, I think the fact that massively multilingual language models work
well, with many languages ultimately sharing the same latent space, is a
refutation of the Sapir-Whorf Hypothesis. Modern neural networks have also,
I think, shown us something about what concepts are. Some linguists have
argued that a word like "dog" is a discrete package, a fixed item to which
additional information is attached. Based on my comparison of how humans
think and now neural networks operate, my view is that the concept "dog" is
3 things:
1) A region
of a latent space for doglike concepts.
2) One or more prototype dog
concepts, which are points in that latent space used to define the region of
dog-ness.
3) A convolution-like transformation by which some data can be
packaged into a point in a latent space: "this is a dog" is a way of
examinining some data from a photo.
Math is often considered
universal, but many of the concepts are partly arbitary. For example, some
people have suggested pi as a universal number that alien species would
recognize, but other people argue that 2*pi is a more fundamental constant.
For a slightly more "complex" example, consider imaginary numbers. The
fundamental theorem of algebra involves them, and that sounds
fundamental...but complex numbers can be considered just a special case of
replacing numbers with matrices - specifically, with a subset of 2x2
matrices that can be represented by 2 numbers and multiplied with fewer
operations. For example,
here's Euler's formula in matrix form. There are some advantages to
computation with that representation, but arguably it's just a computational
optimization with no conceptual value.
If we ask whether the concepts
used in current mathematics are "good" or "bad", the usual presumption is
that some are good and some are bad, but those are relative terms that
depend on what concepts they're compared to. Some math concepts considered
important hundreds of years ago are now considered irrelevant.
other issues
When I say the
language of current advanced math is opaque, I'm mainly talking about the
concepts, but people say that to mean other things as well:
names
Overloading of common words can be annoying, especially for technical
generalists who might go from adding matrices of composite numbers to adding
matrices of composite materials. But math isn't any worse in this regard
than various engineering fields.
A lot of mathematical terms are
called [name]'s theorem or [name]'s lemma. These are hard to remember
because they don't provide any information about the topic. (Personally, I
don't usually want to have to remember names of mathematicians unless
they're on the level of Euclid, Gauss, or Hilbert.) But math isn't any worse
in this regard than biology or medicine.
equations
Math equations can be hard to read. I think programming languages are
often clearer. Yes, I've seen mathematicians comparing compact expressions
using symbols for summation and integrals to awkward-looking equivalents in
pseudocode, but they're missing the point. The main reason complex math
equations are hard to read is because they use, eg, 12 single-letter
variables, 7 of which were defined over the previous 3 pages, and 5 of which
are defined below the equation. Nobody sane writes code like that unless
they're entering an obfuscated programming competition. Descriptive variable
names and multi-step definitions are better for complex formulae.
And
then, if you have longer variable names, much of that customary math
notation stops working well. It's also much harder to produce that notation
by typing. The notation of math was originally developed for writing simple
equations on a chalkboard for people already familiar with related work. It
was never meant for typing, extremely complex equations, or distributing
work to people in other fields.
network optimization
The 4-color
theorem was proven with a computer-generated proof over 400 pages long.
Here are some other particularly long math proofs. Conceptual tools are
supposed to make things easy; what long proofs indicate to me isn't that
they have more insights for me to learn, but rather that the tools being
used are inadequate for the task - like people are hammering in nails with a
rock instead of using a nailgun.
Is the answer building a tower of
abstraction even higher? Or...was a wrong turn taken somewhere?
Statistically speaking, some of the turns taken were probably suboptimal.
Let's return to the metaphor of
math terminology as convolutions in a neural network. When a large neural
network is stuck in some bad local minimum, what can be done? There are
multiple options.
The most-effective way to train a neural network is
by distillation, imitating another network with better performance. So,
perhaps the best option would be to find an alien civilization with
more-advanced mathematics and copy the concepts they use.
Sometimes
people training neural networks will start over with a new initialization.
(So, perhaps mathematics should all be re-developed from scratch, but that
seems like a lot of work.) That's done less than it used to be, because
neural networks have gotten larger, and increasing dimensionality adds
connections between (what would be) local minima. These days, it's more
likely that there's a problem with the optimizer than that training is stuck
in a bad local minimum.
Let's consider how gradient descent works, and
how that compares to development of math. A network is tried on many tasks,
and the effect that various changes would have on performance is averaged
out across those tasks. Then, the whole network is updated slightly, and the
process repeats. So, people use math to do tasks, and sometimes they notice
small changes that would improve things for them. Do people share those
possible changes, average them out, and then apply them and see how they
work out, perhaps stochastically if they're discrete changes? No; what
happens more often is that mathematicians develop private notations that
they use for their own notes. Friction is too high for the culture of
mathematics to proceed down long shallow gradients.
If math involves
metaphorical convolutions, and good concepts are good because they produce
points in a well-structured latent space, that means that math doesn't
advance through new proofs and theorems per se. Rather, math advances from
new concepts and transformations, and proofs are just the means by which
they're tested. This then implies that a shorter and more elegant proof of
something already proven is just as important as a new proof, perhaps even
more so. But incentives in mathematics aren't structured around that being
the case, perhaps because elegance of proofs is harder for institutions to
measure.
As for why I'm writing this now, it's because I've been
thinking about questions like "why Transformers work better than other
neural network architectures". The tools developed by mathematicians so far
seem inadequate for that, besides trivialities like distance in
high-dimensional Euclidean spaces.