Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn't be the greatest description of any of these directions.
Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.
Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.
This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution.
I strongly disagree! AFAICT SGD works so well for capabilities that interpretability/actually understanding models/etc. is highly neglected and there's low-hanging fruit all over the place.
To me, the label "Science of DL" is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).
Got it, I was mostly responding to the third paragraph (insight into why SGD works, which I think is mostly an interpretability question) and should have made that clearer.
I think the situation I'm considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of "why SGD works" (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from "why SGD works".
Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).
I'm not certain about it either but I'm less skeptical. However, I agree with you that some of this could be capabilities work and has to be treated with caution.
However, I think to answer some of the important questions around Deep Learning, e.g. which concepts they learn and under which conditions, we just need to get a better understanding of the entire pipeline. I think it's plausible that this is very hard and progress is much slower than one would hope.
I have written down a long list of alignment ideas that I’d be interested in working on. The ideas roughly boil down to “To make progress on alignment, we need to understand Deep Learning models and the process by which they arrive at their final parameters in much more detail than we currently do”.
Here is the link to the full version (comments are on, please don’t abuse it): https://docs.google.com/document/d/1AyuTphQ31rLHDtpZoEwEPb4fWbZna1H3hGx_YUACxk4/edit?usp=sharing
The rest of this post is an overview copied from the doc. Feedback is welcome.
Overview - Science of Deep Learning
By Science of DL, I roughly mean “understanding DL systems and how they learn concepts” better. The main goal is to propose a precise and testable hypothesis related to a phenomenon in DL and then test and refine it until we are highly confident in its truth or falsehood. This hypothesis could be about how NNs behave on the neuron level, the circuit level, during training, during fine-tuning, etc. This research will almost surely at some point include mechanistic interpretability but it is not limited to it.
The refined statement after investigation can but doesn’t have to be of mathematical form as long as it is unambiguous and can be tested, i.e. two people could agree on an experiment that would provide evidence for or against the statement and then run it.
How this could look in practice
The details would obviously differ from project to project but on a high level I imagine it to look roughly like this
Goals
The goal of this research is to understand DL systems as well as possible. This means there is not one clear goal by that we could judge our performance. However, I think there are some ways to test whether we actually increased our understanding of different parts of the system. These include
Caveat
Understanding more parts of the DL pipeline can always also lead to an increase in dangerous capabilities. Essentially, whenever we understand technology better, we can use that knowledge to make it more efficient or powerful.
However,
Final words
I’m currently excited about this agenda and will likely explore some of the project ideas in the long doc in the near future. However, I’m still uncertain how promising I find the agenda compared to other approaches to alignment. Feedback and considerations are welcome.