Comment Permalink

Quintin Pope3y90

I don't think the scenario you describe is as bad for interpretability as you assume. In fact, self-optimizing systems may even be more interpretable than current systems. E.g., current systems use a single channel for all their computation. This causes them to mix conceptually different types of computation together in a way that's very difficult to unravel. In contrast, the brain has sub-regions specializing in different types of computation (vision, hearing, reward calculation, etc). I expect self-optimizing systems will do something similar.

Also, I'm not sure how much being able to design the architecture/optimizer helps us in interpretability. In my recent post, I argue the brain is in many ways more interpretable than current deep learning systems. We still don't understand the brain's learning algorithm and there are many difficulties associated with studying the brain, yet brain interpretability research is surprisingly advanced.

Most ML interpretability research tends to rely on examining patterns in model internal representations, feature visualizations, gradient-based attribution of inputs/neurons, training classifiers on model internal representations, or studying black box input/output patterns. I think most of those can be adapted to self-optimizing models, though we may have to do some work to find a useful gradient-equivalent if the system messes with the optimizer too much (note that we don't have any gradient-equivalent in neuroscience).

Finally, I think there are many options available for improving interpretability in ML systems, none of which we are using for current state of the art systems. Just switching from L2 to L1 regularization and removing dropout would probably lead to much sparser internal representations. I think there are also ways of directly training models to be more interpretable. I describe a way to use current interpretability techniques to generate an estimator for model interpretability in the same post linked above. We could then include that signal in the training objective for self-optimizing systems. Hopefully, the system learns to be more accessible to the interpretability techniques we use.

Another option: given a large, "primary" model, you could train multiple smaller "secondary" models (with different architectures) to imitate the primary model (knowledge distillation), then train the primary model to improve the secondary models' imitation performance. This should cause the primary model to learn internal representations that are more easily learned by other models of various architectures, and are hopefully more interpretable to humans. If you then assume the primary model is a self-optimizing system, then this approach becomes even more promising because now the self-optimizing systems is actively looking for architectures that are easy for weaker models to understand.

This approach is even defensible from a purely profit-seeking / competitive point of view because it's a good idea to have your most powerful model be a good teacher to smaller models. That way, you can more easily distill its capabilities into a cheaper system and save money on compute.

32

DL towards the unaligned Recursive Self-Optimization attractor

32

32