Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?
There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:
Neural Networks Are Fundamentally Bayesian
This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:
https://arxiv.org/abs/2006.15191
https://arxiv.org/abs/1909.11522
https://arxiv.org/abs/1805.08522
(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)
I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.
If you want some more stuff to read in this genre, I can also recommend these two posts:
Recent Progress in the Theory of Neural Networks
Understanding "Deep Double Descent"
EDIT: Here is a second post, which talks more about the "prior" of neural networks:
Deep Neural Networks are biased, at initialisation, towards simple functions
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if P(f)≃2−K(x)+C regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment -- what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?