Consider this abridged history of recent ML progress:
A decade or two ago, computer vision was a field that employed dedicated researchers who designed specific increasingly complex feature recognizers (SIFT, SURF, HoG, etc.) These were usurped by deep CNNs with fully learned features in the 2010's[1], which subsequently saw success in speech recognition, various NLP tasks, and much of AI, competing with other general ANN models, namely various RNNs and LSTMs. Then SOTA in CNNs and NLP evolved separately towards increasingly complex architectures until the simpler/general transformers took over NLP and quickly spread to other domains (even RL), there also often competing with newer simpler/general architectures arising within those domains, such as MLP-mixers in vision. Waves of colonization in design-space.
So the pattern is: increasing human optimization power steadily pushing up architecture complexity is occasionally upset/reset by a new simpler more general model, where the new simple/general model substitutes human optimization power for automated machine optimization power[2], enabled by improved compute scaling, ala the bitter lesson. DL isn't just a new AI/ML technique, it's a paradigm shift.
Ok, fine, then what's next?
All of these models, from the earliest deep CNNs on GPUs up to GPT-3 and EfficientZero, generally have a few major design components that haven't much changed:
- Human designed architecture, rather than learned or SGD-learnable-at-all
- Human designed backprop SGD variant (with only a bit of evolution from vanilla SGD to Adam & friends)
Obviously there are research tracks in DL such as AutoML/Arch-search and Meta-learning aiming to automate the optimization of architecture and learning algorithms. They just haven't dominated yet.
So here is my hopefully-now-obvious prediction: in this new decade internal meta-optimization will take over, eventually leading to strongly recursively self optimizing learning machines: models that have broad general flexibility to adaptively reconfigure their internal architecture and learning algorithms dynamically based on the changing data environment/distribution and available compute resources[3].
If we just assume for a moment that the strong version of this hypothesis is correct, it suggests some pessimistic predictions for AI safety research:
- Interpretability will fail - future DL descendant is more of a black box, not less
- Human designed architectural constraint fails, as human designed architecture fails
- IRL/Value Learning is far more difficult than first appearances suggest, see #2
- Progress is hyper-exponential, not exponential. Thus trying to trend-predict DL superintelligence from transformer scaling is more difficult than trying to predict transformer scaling from pre 2000-ish ANN tech, long before rectifiers and deep layer training tricks.
- Global political coordination on constraints will likely fail, due to #4 and innate difficulty.
There is an analogy here to the history-revision attack against Bitcoin. Bitcoin's security derives from the computational sacrifice invested into the longest chain. But Moore's Law leads to an exponential decrease in the total cost of that sacrifice over time, which when combined with an exponential increase in total market cap, can lead to the surprising situation where recomputing the entire PoW history is not only plausible but profitable.[4]
In 2010 few predicted that computer Go would beat a human champion just 5 years hence[5], and far fewer (or none) predicted that a future successor of that system would do much better by relearning the entire history of Go strategy from scratch, essentially throwing out the entire human tech tree [6].
So it's quite possible that future meta-optimization throws out the entire human architecture/algorithm tech tree for something else substantially more effective[7]. The circuit algorithmic landscape lacks most all the complexity of the real world, and in that sense is arguably much more similar to Go or chess. Humans are general enough learning machines to do reasonably well at anything, but we can only apply a fraction of our brain capacity to such an evolutionary novel task, and tend to lose out to more specialized scaled up DL algorithms long before said algorithms outcompete humans at all tasks, or even everday tasks.
Yudkowsky anticipated recursive self-improvement would be the core thing that enables AGI/superintelligence. Reading over that 2008 essay now in 2021, I think he mostly got the gist of it right, even if he didn't foresee/bet that connectivism would be the winning paradigm. EY2008 seems to envision RSI as an explicit cognitive process where the AI reads research papers, discusses ideas with human researchers, and rewrites its own source code.
Instead in the recursive self-optimization through DL future we seem to be careening towards, the 'source code' is the ANN circuit architecture (as or more powerful than code), and reading human papers, discussing research: all that is unnecessary baggage, as unnecessary as it was for AlphaGo Zero to discuss chess with human chess experts over tea or study their games over lunch. History-revision attack, incoming.
So what can we do? In the worst case we have near-zero control over AGI architecture or learning algorithms. So that only leaves initial objective/utility functions, compute and training environment/data. Compute restriction is obvious and has an equally obvious direct tradeoff with capability - not much edge there.
Even a super powerful recursive self-optimizing machine initially starts with some seed utility/objective function at the very core. Unfortunately it increasingly looks like efficiency strongly demands some form of inherently unsafe self-motivation utility function, such as empowerment or creativity, and self-motivated agentic utility functions are the natural strong attractor[8].
Control over training environment/data is a major remaining lever that doesn't seem to be explored much, and probably has better capability/safety tradeoffs than compute. What you get out of the recursive self optimization or universal learning machinery is always a product of the data you put in, the embedded environment; that is ultimately what separates Go bots, image detectors, story writing AI, feral children, and unaligned superintelligences.
And then finally we can try to exert control on the base optimizer, which in this case is the whole technological research industrial economy. Starting fresh with a de novo system may be easier than orchestrating a coordination miracle from the current Powers.
Alexnet is typically considered the turning point, but the transition started earlier; sparse coding and RBMs are two examples of successful feature learning techniques pre-DL. ↩︎
If you go back far enough, the word 'computer' itself originally denoted a human occupation! This trend is at least a century old. ↩︎
DL ANNs do a form of approximate bayesian updating over the implied circuit architecture space with every backprop update, which already is a limited form of self-optimization. ↩︎
Blockchain systems have a simple defense against history-revision attack: checkpointing, but unfortunately that doesn't have a realistic equivalent in our case - we don't control the timestream. ↩︎
I would have bet against this; AlphaGo Zero surprised me far more than AlphaGo. ↩︎
Quite possible != inevitable. There is still a learning efficiency gap vs the brain, and I have uncertainty over how quickly we will progress past that gap, and what happens after. ↩︎
Tool-AI, like GPT-3, is a form of capability constraint, in that economic competition is always pressuring tool-AIs to become agent-AIs. ↩︎
I'm not exactly sure what you mean by "single channel", but I do agree that regional specialization in the brain probably makes it more interpretable. Regional specialization arises because of locality optimization to minimize wiring length. But with ANNs running on von neumman hardware we simulate circuits and don't optimize for locality currently. However that could change with either neuromorphic hardware or harder sparsity optimization on normal hardware. So yes those are reasons to be more optimistic.
I'm not actually convinced that interpretability is doomed - in the OP I was exploring something of a worst case possibility. The scenarios where interpretability fails are those where the internal meta-optimization fooms in complexity well beyond that of the brain and our affordable comprehension toolkit. It's not a matter of difficulty in analyzing/decoding a static architecture, the difficulty is in analyzing a rapidly evolving architecture that may be distributed/decentralized, and the costs thereof. If we end with something brain-like, then interpretability is promising. But interepretability becomes exponentially harder with rapid self-optimization and architectural diversity, especially when/if the systems are geographically distributed/decentralized across organizational boundaries (which certainly isn't the case now, but could be what the AI economy evolves into).
In your interesting post you mention a DM paper investigating AlphaZero's learned chess representations. The issue is how the cost of that analysis scales as we move to recursively self-optimizing systems and we scale up compute. (I'm also curious about the same analysis for Go)
Before reading your article, my initial take was that interpretability techniques for ANNs and BNNs are actually not all that different - but ANNs are naturally much easier to monitor and probe.
Yes agreed as discussed above.
So in summary - as you note in your post: "We currently put very little effort into making state of the art systems interpretable."
The pessimistic scenario is thus simply that this status quo doesn't change, because willingness to spend on interpretability doesn't change much, and moving towards recursive self-optimization increases the cost.