Consider this abridged history of recent ML progress:
A decade or two ago, computer vision was a field that employed dedicated researchers who designed specific increasingly complex feature recognizers (SIFT, SURF, HoG, etc.) These were usurped by deep CNNs with fully learned features in the 2010's[1], which subsequently saw success in speech recognition, various NLP tasks, and much of AI, competing with other general ANN models, namely various RNNs and LSTMs. Then SOTA in CNNs and NLP evolved separately towards increasingly complex architectures until the simpler/general transformers took over NLP and quickly spread to other domains (even RL), there also often competing with newer simpler/general architectures arising within those domains, such as MLP-mixers in vision. Waves of colonization in design-space.
So the pattern is: increasing human optimization power steadily pushing up architecture complexity is occasionally upset/reset by a new simpler more general model, where the new simple/general model substitutes human optimization power for automated machine optimization power[2], enabled by improved compute scaling, ala the bitter lesson. DL isn't just a new AI/ML technique, it's a paradigm shift.
Ok, fine, then what's next?
All of these models, from the earliest deep CNNs on GPUs up to GPT-3 and EfficientZero, generally have a few major design components that haven't much changed:
- Human designed architecture, rather than learned or SGD-learnable-at-all
- Human designed backprop SGD variant (with only a bit of evolution from vanilla SGD to Adam & friends)
Obviously there are research tracks in DL such as AutoML/Arch-search and Meta-learning aiming to automate the optimization of architecture and learning algorithms. They just haven't dominated yet.
So here is my hopefully-now-obvious prediction: in this new decade internal meta-optimization will take over, eventually leading to strongly recursively self optimizing learning machines: models that have broad general flexibility to adaptively reconfigure their internal architecture and learning algorithms dynamically based on the changing data environment/distribution and available compute resources[3].
If we just assume for a moment that the strong version of this hypothesis is correct, it suggests some pessimistic predictions for AI safety research:
- Interpretability will fail - future DL descendant is more of a black box, not less
- Human designed architectural constraint fails, as human designed architecture fails
- IRL/Value Learning is far more difficult than first appearances suggest, see #2
- Progress is hyper-exponential, not exponential. Thus trying to trend-predict DL superintelligence from transformer scaling is more difficult than trying to predict transformer scaling from pre 2000-ish ANN tech, long before rectifiers and deep layer training tricks.
- Global political coordination on constraints will likely fail, due to #4 and innate difficulty.
There is an analogy here to the history-revision attack against Bitcoin. Bitcoin's security derives from the computational sacrifice invested into the longest chain. But Moore's Law leads to an exponential decrease in the total cost of that sacrifice over time, which when combined with an exponential increase in total market cap, can lead to the surprising situation where recomputing the entire PoW history is not only plausible but profitable.[4]
In 2010 few predicted that computer Go would beat a human champion just 5 years hence[5], and far fewer (or none) predicted that a future successor of that system would do much better by relearning the entire history of Go strategy from scratch, essentially throwing out the entire human tech tree [6].
So it's quite possible that future meta-optimization throws out the entire human architecture/algorithm tech tree for something else substantially more effective[7]. The circuit algorithmic landscape lacks most all the complexity of the real world, and in that sense is arguably much more similar to Go or chess. Humans are general enough learning machines to do reasonably well at anything, but we can only apply a fraction of our brain capacity to such an evolutionary novel task, and tend to lose out to more specialized scaled up DL algorithms long before said algorithms outcompete humans at all tasks, or even everday tasks.
Yudkowsky anticipated recursive self-improvement would be the core thing that enables AGI/superintelligence. Reading over that 2008 essay now in 2021, I think he mostly got the gist of it right, even if he didn't foresee/bet that connectivism would be the winning paradigm. EY2008 seems to envision RSI as an explicit cognitive process where the AI reads research papers, discusses ideas with human researchers, and rewrites its own source code.
Instead in the recursive self-optimization through DL future we seem to be careening towards, the 'source code' is the ANN circuit architecture (as or more powerful than code), and reading human papers, discussing research: all that is unnecessary baggage, as unnecessary as it was for AlphaGo Zero to discuss chess with human chess experts over tea or study their games over lunch. History-revision attack, incoming.
So what can we do? In the worst case we have near-zero control over AGI architecture or learning algorithms. So that only leaves initial objective/utility functions, compute and training environment/data. Compute restriction is obvious and has an equally obvious direct tradeoff with capability - not much edge there.
Even a super powerful recursive self-optimizing machine initially starts with some seed utility/objective function at the very core. Unfortunately it increasingly looks like efficiency strongly demands some form of inherently unsafe self-motivation utility function, such as empowerment or creativity, and self-motivated agentic utility functions are the natural strong attractor[8].
Control over training environment/data is a major remaining lever that doesn't seem to be explored much, and probably has better capability/safety tradeoffs than compute. What you get out of the recursive self optimization or universal learning machinery is always a product of the data you put in, the embedded environment; that is ultimately what separates Go bots, image detectors, story writing AI, feral children, and unaligned superintelligences.
And then finally we can try to exert control on the base optimizer, which in this case is the whole technological research industrial economy. Starting fresh with a de novo system may be easier than orchestrating a coordination miracle from the current Powers.
Alexnet is typically considered the turning point, but the transition started earlier; sparse coding and RBMs are two examples of successful feature learning techniques pre-DL. ↩︎
If you go back far enough, the word 'computer' itself originally denoted a human occupation! This trend is at least a century old. ↩︎
DL ANNs do a form of approximate bayesian updating over the implied circuit architecture space with every backprop update, which already is a limited form of self-optimization. ↩︎
Blockchain systems have a simple defense against history-revision attack: checkpointing, but unfortunately that doesn't have a realistic equivalent in our case - we don't control the timestream. ↩︎
I would have bet against this; AlphaGo Zero surprised me far more than AlphaGo. ↩︎
Quite possible != inevitable. There is still a learning efficiency gap vs the brain, and I have uncertainty over how quickly we will progress past that gap, and what happens after. ↩︎
Tool-AI, like GPT-3, is a form of capability constraint, in that economic competition is always pressuring tool-AIs to become agent-AIs. ↩︎
I don't think the scenario you describe is as bad for interpretability as you assume. In fact, self-optimizing systems may even be more interpretable than current systems. E.g., current systems use a single channel for all their computation. This causes them to mix conceptually different types of computation together in a way that's very difficult to unravel. In contrast, the brain has sub-regions specializing in different types of computation (vision, hearing, reward calculation, etc). I expect self-optimizing systems will do something similar.
Also, I'm not sure how much being able to design the architecture/optimizer helps us in interpretability. In my recent post, I argue the brain is in many ways more interpretable than current deep learning systems. We still don't understand the brain's learning algorithm and there are many difficulties associated with studying the brain, yet brain interpretability research is surprisingly advanced.
Most ML interpretability research tends to rely on examining patterns in model internal representations, feature visualizations, gradient-based attribution of inputs/neurons, training classifiers on model internal representations, or studying black box input/output patterns. I think most of those can be adapted to self-optimizing models, though we may have to do some work to find a useful gradient-equivalent if the system messes with the optimizer too much (note that we don't have any gradient-equivalent in neuroscience).
Finally, I think there are many options available for improving interpretability in ML systems, none of which we are using for current state of the art systems. Just switching from L2 to L1 regularization and removing dropout would probably lead to much sparser internal representations. I think there are also ways of directly training models to be more interpretable. I describe a way to use current interpretability techniques to generate an estimator for model interpretability in the same post linked above. We could then include that signal in the training objective for self-optimizing systems. Hopefully, the system learns to be more accessible to the interpretability techniques we use.
Another option: given a large, "primary" model, you could train multiple smaller "secondary" models (with different architectures) to imitate the primary model (knowledge distillation), then train the primary model to improve the secondary models' imitation performance. This should cause the primary model to learn internal representations that are more easily learned by other models of various architectures, and are hopefully more interpretable to humans. If you then assume the primary model is a self-optimizing system, then this approach becomes even more promising because now the self-optimizing systems is actively looking for architectures that are easy for weaker models to understand.
This approach is even defensible from a purely profit-seeking / competitive point of view because it's a good idea to have your most powerful model be a good teacher to smaller models. That way, you can more easily distill its capabilities into a cheaper system and save money on compute.
One of these improvements was just published: https://arxiv.org/abs/2202.03599 . Since they were able to publish already, they likely had this idea before me. What I noticed is that in the Sharpness-Aware Minimization paper (ICLR 2021, https://arxiv.org/abs/2010.01412), the first gradient is just ignored when updating the weights, as can be seen in Figure 2 or in pseudo-code. But that's a valuable data point that the optimizer would normally use to update the weights, so why not do the update step by using a value in between the two. And it works.
The nice... (read more)