Gary Marcus is a pretty controversial figure within machine learning. He's become widely known for being the most outspoken critic of the current paradigm of deep learning, arguing that it will be insufficient to yield AGI. Two years ago he came out with Deep Learning: A Critical Appraisal, which to my knowledge remains the most in-depth critique of deep learning, and is cited heavily in AI Impact's post about whether current techniques will lead to human-level AI.

Now, by releasing this new paper, he answers a frequent objection from deep learning specialists that all he does is critique rather than adding anything new.

Gary takes a stand against what he sees as an undue focus on leveraging ever larger amounts of computation to aid in learning, most notably argued for by Richard Sutton. He sees much more promise in instilling human knowledge and using built-in causal and symbolic reasoning, defying what is likely a consensus in the field.

I personally find his approach interesting because of how it might lead to AI systems that are more transparent and align-able, though I'm fairly confident that his analysis will have no significant impact on the field.

I'll provide a few quotes from the article and explain the context. Here, he outlines his agenda.

The burden of this paper has been to argue for a shift in research priorities, towards four cognitive prerequisites for building robust artificial intelligence: hybrid architectures that combine large-scale learning with the representational and computational powers of symbol-manipulation, large-scale knowledge bases—likely leveraging innate frameworks—that incorporate symbolic knowledge along with other forms of knowledge, reasoning mechanisms capable of leveraging those knowledge bases in tractable ways, and rich cognitive models that work together with those mechanisms and knowledge bases.

Marcus is aware of the popular critique that symbolic manipulation was tried and didn't yield AGI, but he thinks that this is a false critique. He thinks it's likely that the brain uses both principles from deep learning and symbolic manipulation.

A great deal of psychological evidence, reviewed above in Section 2.1.1., supports the notion that symbol-manipulation is instantiated in the brain, such as the ability of infants to extend novel abstract patterns to new items, the ability of adults to generalize abstract linguistic patterns to nonnative sounds which they have no direct data for, the ability of bees to generalize the solar azimuth function to lighting conditions they have not directly observed. Human beings can also learn to apply formal logic on externally represented symbols, and to program and debug symbolically represented computers programs, all of which shows that at least in some configurations neural wetware can indeed (to some degree, bounded partly by memory limitations) manipulate symbols. And we can understand language in essentially infinite variety, inferring an endless range of meanings from an endless range of sentences. The kind of free generalization that is the hallmark of operations over variables is widespread, throughout cognition

Another reason why people think that symbolic manipulation won't work is because it's not scalable. It requires hundreds of hours to encode knowledge into our systems, and within years a system based on learning will outperform it anyway. So why bother? Marcus responds

Although there are real problems to be solved here, and a great deal of effort must go into constraining symbolic search well enough to work in real time for complex problems, Google Knowledge Graph seems to be at least a partial counterexample to this objection, as do large scale recent successes in software and hardware verification. Papers like Minervini et al (Minervini et al., 2019) and Yang et al (Yang, Yang, & Cohen, 2017) have made real progress towards building end-to-end differentiable hybrid neurosymbolic systems that work at scale. Meanwhile. no formal proof of the impossibility of adequate scaling, given appropriate heuristics, exists.
[...]
OpenAI's Rubik's solver (OpenAI et al., 2019) is (although it was not pitched as such) is a hybrid of a symbolic algorithm for solving the cognitive aspects of a Rubik's cube, and deep reinforcement learning for the manual manipulation aspects. At a somewhat smaller scale Mao et al., (Mao, Gan, Kohli, Tenenbaum, & Wu, 2019) have recently proposed a hybrid neural net-symbolic system for visual question answering called NS-CL (short for the Neuro-Symbolic concept learner) that surpasses the deep learning alternatives they examined. Related work by Janner et al (Janner et al., 2018). pairs explicit records for individual objects with deep learning in order to make predictions and physics-based plans that far surpass a comparable pure black box deeplearning approach. Evans and Grefenstette (Evans & Grefenstette, 2017) showed how a hybrid model could better capture a variety of learning challenges, such as the game fizz-buzz, which defied a multlayer perceptron. A team of people including Smolensky and Schmidhuber have produced better results on a mathematics problem set by combining BERT with a tensor products (Smolensky et al., 2016), a formal system for representing symbolic variables and their bindings (Schlag et al., 2019), creating a new system called TP-Transformer.

This response puzzles me since I imagine most would consider neurosymbolic systems like the ones he cites (and SATNet) to fit quite nicely into the current deep learning paradigm. Nevertheless, he thinks the field is not pushing sufficiently far in that direction.

One reason that many have given for believing that deep learning will yield AGI is that the brain doesn't have many built in priors, or innate knowledge, and thus building innate knowledge into our systems is not necessary. He thinks that the human brain does use built-in knowledge.

My December 2019 debate with Yoshua Bengio was similarly revealing. He said that it was acceptable to prespecify convolution, because it would only take “three lines of code”, but worried about expanding the set of priors (bits of innate/prior knowledge) far beyond convolution, particularly if those priors would require that more than a few bits of information be specified.
As I expressed to him there, I wouldn’t worry so much about the bits. More than 90% of our genome is expressed in the development of the brain (Miller et al., 2014; Bakken et al., 2016), and a significant number of those genes are expressed selectively in particular areas, giving rise to detailed initial structure. And there are many mechanisms by which complex structures are specified using a modest number of genes; in essence the genome is compressed way of building structure, semi-autonomously (Marcus, 2004); there’s no reason to think that biological brains are limited to just a few “small” priors.

Many have said that the critique Marcus gave in 2018 is outdated since recent natural language models are much more impressive than he predicted. He provides many paragraphs in the paper explaining why he thinks the current Transformer model is not as good as people say it is.

The range of failures in language understanding of current Transformers such as GPT-2 (see Marcus, 2019, 2020) reflects something similar: the schism between predicting general tendencies (like the likelihood of the phrase mom's house appearing the neighborhood of the words and phrases such as drop, off, pick, up and clothing in the corpora GPT-2) and the capacity to represent, update, and manipulate cognitive models. When BERT and GPT-2 failed to track where the dry cleaning would be it was a direct reflection of the fact GPT and BERT have no representation of the properties of individual entities as they evolve over time. Without cognitive models, systems like these are lost. Sometimes they get lucky from statistics, but lacking cognitive models they have no reliable foundation with which to reason over.
The lack of cognitive models is also bleak news for anyone hoping to use a Transformer as input to a downstream reasoning system. The whole essence of language comprehension is to derive cognitive models from discourse; we can then reason over the models we derive. Transformers, at least in their current form, just don't do that. Predicting word classes is impressive, but in of itself prediction does not equal understanding

At one point he echoes concerns about future systems based on deep learning that sound faintly similar to those expressed in the Rocket Alignment Problem. [EDIT: eh, probably not that similar]

One cannot engineer a robust system out of parts with so little guarantee of reliability. One problem with trying to build a system out of parts with such little reliability is that downstream inference will inevitably suffer. The whole point of having knowledge is to use it in action and interpretation and decision-making. If you don't know what can cause a fire, or what happens when a bottle breaks, it’s hard to make inferences about what is happening around you.
New Comment
3 comments, sorted by Click to highlight new comments since:
At one point he echoes concerns about future systems based on deep learning that sound faintly similar to those expressed in the Rocket Alignment Problem.

The quoted paragraph does not sound like the Rocket Alignment problem to me. It seems to me that the quoted paragraph is arguing that you need to have systems that are robust, whereas the Rocket Alignment problem argues that you need to have a deep understanding of the systems you build. These are very different: I suspect the vast majority of AI safety researchers would agree that you need robustness, but you can get robustness without understanding, e.g. I feel pretty confident that AlphaZero robustly beats humans at Go, even though I don't understand what sort of reasoning AlphaZero is doing.

(A counterargument is that we understand how the AlphaZero training algorithm incentivizes robust gameplay, which is what rocket alignment is talking about, but then it's not clear to me why the rocket alignment analogy implies that we couldn't ever build aligned AI systems out of deep learning.)

To clarify, I had first read the "the whole point of having knowledge" sentence in light of the fact that he wants to hardcode knowledge into our systems, and from that point of view it made more sense. I am re-reading and it's not the best comparison admittedly. The rest of the paper still echoes the general vibe of not doing random searches for answers, and leveraging our human understanding to yield some sort of robustness.

A team of people including Smolensky and Schmidhuber have produced better results on a mathematics problem set by combining BERT with a tensor products (Smolensky et al., 2016), a formal system for representing symbolic variables and their bindings (Schlag et al., 2019), creating a new system called TP-Transformer.

Notable that the latter paper was rejected from ICLR 2020, partly for unfair comparison. It seems unclear at present whether TP-Transformer is better than the baseline transformer.