Summary

The many successes of Deep Learning over the past ten years have catapulted Artificial Intelligence into the public limelight in an unprecedented way. Nevertheless, both critics and advocates have expressed the opinion that deep learning alone is not sufficient for real human-like intelligence, which is filled with capabilities that require symbolic manipulation. The disagreement has now shifted to the way symbol manipulation should be construed. Critics of DL believe that classical symbolic architectures (with combinatorial syntax and semantics) should be imposed on top of deep learning systems while advocates believe in a new kind of emergent symbolic system which can be learned through gradient descent, like any other function. This is necessary, they argue, because classical systems have already proven to be intractable and therefore cannot be used to construct systems capable of human-like intelligence. In this essay I consider these two options and show that the DL advocates are mistaken, by considering the way in which deep learning systems solve the problem of generating software code to a natural language prompt. Programming languages like Python are classical symbol systems, yet deep learning models have become remarkably good at producing syntactically correct code. What this shows is that it is possible to mimic some aspects of a classical symbolic system with a neural network, and without an explicit classical sub-system. There is no reason to think this conclusion does not apply to other tasks where deep learning models perform well, like natural language tasks. In other words, if you looked at a neural network generating Python code, you might be tempted to conclude that Python is not a classical symbolic system. You would of course be wrong. Similalrly if you looked at a neural network generating natural language, you might be tempted to conclude that natural language is not a classical symbolic system. You might also be wrong: there is no compelling reason offered by DL networks that shows that something like a non-classical symbolic reasoning system is required for human cognition including language. DL systems can learn statistical mappings where a classical symbolic system produces lots of examples, like language or Python. When the symbol system is used for planning, creativity, etc., this is where DL struggles to learn. This leaves us to conclude that (a) deep learning alone won't lead to AGI, (b) nor will deep learning supplemented with non-classical symbolic systems, (c) nor, apparently will deep learning supplemented with classical symbolic systems. So, no AGI by 2043. Certainly not AI that includes "entirely AI-run companies, with AI managers and AI workers and everything being done by AIs."  I suggest that instead of bringing us AGI, modern deep learning has instead revitalized Licklider's vision of "Man-Computer Symbiosis", which is the most exciting prospect for the immediate future of AI.

The AI Promise


Artificial Intelligence has certainly had its ups and downs, from the heady days of "good old-fashioned AI" through the winter downturns, and now back on track to supposedly reach human level cognitive abilities in the not-too-distant future. The Future Fund has estimated that there is a 20% chance that Artificial General Intelligence (AGI) will be developed by January 1, 2043, and a 60% chance by January 1, 2100 (in their "Future Fund worldview prize"). I am going to argue that the optimism is unwarranted, and both of these estimates are wildly inaccurate. In fact the probabilities should be closer to zero.

Why am I so certain? In order to understand my conviction, we have to consider the source of the overwhelming confidence that AI researchers currently possess. The primary reason can relatively straightforwardly be attributed to the sudden breakthroughs achieved by the neural network or connectionist paradigm, which has an intuitive advantage over other paradigms because of its apparent similarity to real neural networks in the brain. This alternative to "good old fashioned" logic-based symbol manipulating AI systems has existed since the 1940's, but has had a patchy history of success, with a cycle of "hitting brick walls" and then overcoming them. That is, until around 2012 when a combination of widely available compute power, massive public datasets, and advancements in architectures enabled Deep Learning (DL) models to suddenly leapfrog existing state-of-the-art systems. The age of symbolic AI was quickly declared dead by some, and a view emerged that the key breakthroughs were essentially complete, and we just have to "scale up" now. For example, Tesla CEO Elon Musk and Nvidia CEO Jen-Hsun Huang declared in 2015 that the problem of building fully autonomous vehicles was essentially solved.

Kinds of Symbol Systems

The key difference between a classical symbol manipulating system and a distributed neural "connectionist" system was clearly laid out by Fodor and Pylyshyn in 1988. They argued that the issue was not whether or not the systems were representational, or whether distributed systems can represent discrete concepts, but how the rules of the system are defined. That is, in what way do representations take part in the causal processes that transform them into other representations. Consider a simple example from propositional logic. The rule A & B -> A is a tautology because there is no assignment of truth values to A and B which will make it false. The implication can only be false if the precedent on the left-hand side is true but the consequent on the right-hand side is false. But this is not possible because if A on the right-hand side is false, it must also be false on the left-hand side, causing the conjunction to also be false. Importantly, this reasoning is only possible because the A on the right hand side is the same symbol as the A on the left-hand side. Classical systems have combinatorial syntax and semantics. I tried to see if GPT-3 has any conception about such combinatorial semantics with an example: "Is it true that if the moon is made of green cheese, and cows chew cud, then the moon is made of green cheese?", and GPT-3 answers "No, it is not true that if the moon is made of green cheese, and cows chew cud, then the moon is made of green cheese.", which is not correct. GPT-3 appears to focus on whether or not the consequent itself is true or false.  

Perhaps the most vocal contemporary advocate for classical symbolic systems is Gary Marcus who argues that, in order to make further progress, AI needs to combine symbolic and DL solutions into hybrid systems. As a rebuttal to Marcus, Jacob Browning and Yann LeCun argue that there is no need for such hybrids because symbolic representations can "emerge" from neural networks. They argue that "the neural network approach has traditionally held that we don’t need to hand-craft symbolic reasoning but can instead learn it: Training a machine on examples of symbols engaging in the right kinds of reasoning will allow it to be learned as a matter of abstract pattern completion. In short, the machine can learn to manipulate symbols in the world, despite not having hand-crafted symbols and symbolic manipulation rules built in." That is, symbols can be manipulated in the absence of specific rules in the classical sense. However, after making a strong case for this alternative kind of symbol manipulation, they then argue that it is not central to human cognition after all "... most of our complex cognitive capacities do not turn on symbolic manipulation; they make do, instead, with simulating various scenarios and predicting the best outcomes." They further clarify that, to the extent that symbol manipulation is important at all, it is primarily a "cultural invention" and "regards symbols as inventions we used to coordinate joint activities — things like words, but also maps, iconic depictions, rituals and even social roles." "The goal, for DL, isn’t symbol manipulation inside the machine, but the right kind of symbol-using behaviors emerging from the system in the world." Browning and LeCun argue that the critical insight from DL is to outlaw classical symbol manipulation as a genuine, generative process in the human mind, and that therefore hybrid systems have no place in a cognitive agent.  

Language Models and Computer Code

While Browning and LeCun's argument may have a certain (though vague) appeal, it proves to be extraordinarily problematical in explaining the ever-increasing success that large neural language models are showing with generating computer code. While the language models were originally conceived for modeling natural language, it was discovered that if the training included some computer code from sources such as GitHub, then they could generate computer code from natural language specifications, sometimes at a level nigher than humans. Language Models (LMs) in fact developed independently of neural models and are simply joint probability distributions over sequences of words. Large Neural LMs are a subsequent advancement in that they learn probability distributions for sequences of real valued, continuous vector representations of words rather than discrete lexical items. The probability distribution is learned through a form of language modeling, where the task is to "predict the next word given the previous words" in word strings drawn from a corpus. Essentially LMs learn complex statistical properties of language and can perform at exceptional levels on a large number of tasks involving language including translation, inference, and even story telling. The models are very large indeed, with billions or more parameters. For example, Nvidia is proposing Megatron, a parallel architecture that can scale to 1 trillion parameters.

It turns out that LMs are competent learners of complex statistical distributions outside natural language. As previously mentioned, LMs that have received training on software code have become competent at generating syntactically well formed, functional code with relatively advanced programming problems. While there is a long way to go before they can write entire program implementations, it is clear that they already excel at generating syntactically well-formed productions. They almost never write code that contains syntax errors. But this is a real problem for the claim that neural architectures present an alternative model of symbol manipulation because well-formed software code is defined by classically understood symbolic, rule-based grammars. We must really try to understand how it is that a distributed neural network with an alternative method of manipulating symbols can perform so well with a straightforwardly classical symbolic task.

In fact, high-level programming languages for digital computers, and theories of natural language have a curious historical connection. John W. Backus who led the Applied Science Division of IBM's Programming Research Group took inspiration from Noam Chomsky's work on phrase structure grammars and conceived a meta-language that could specify the syntax of computer languages that were easier for programmers to write than assembler languages. The meta language later became known as Backus-Naur form (BNF), so called partly because it was originally co-developed by Peter Naur in a 1963 IBM report on the ALGOL 60 programming language". The BNF is a notation for context free grammars consisting of productions over termina and nonterminal symbols, which defines the grammar of programming languages required for writing compilers and interpreters

BNF grammars can be invaluable for computer programmers. When a programmer is uncertain about the form of a programming construct, they can consult documentation which specifies the allowable syntax of expressions. The most complete reference is the syntax specification typically written in some form of BNF. For example, the syntax for the "if" statement in Python can be found in the reference guide as shown below:


if_stmt ::=  "if" assignment_expression ":" suite
            ("elif" assignment_expression ":" suite)*
            ["else" ":" suite]
assignment_expression ::=  [identifier ":="] expression
expression             ::=  conditional_expression | lambda_expr
conditional_expression ::=  or_test ["if" or_test "else" expression]
or_test  ::=  and_test | or_test "or" and_test
and_test ::=  not_test | and_test "and" not_test
not_test ::=  comparison | "not" not_test
comparison    ::=  or_expr (comp_operator or_expr)*
comp_operator ::=  "<" | ">" | "==" | ">=" | "<=" | "!="
                  | "is" ["not"] | ["not"] "in"
identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">
suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement     ::=  stmt_list NEWLINE | compound_stmt
stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]
simple_stmt ::=  expression_stmt
                | assert_stmt
                | assignment_stmt
                | augmented_assignment_stmt
                | annotated_assignment_stmt
                | pass_stmt
                | del_stmt
                | return_stmt
                | yield_stmt
                | raise_stmt
                | break_stmt
                | continue_stmt
                | import_stmt
                | future_stmt
                | global_stmt
                | nonlocal_stmt
The Unicode category codes mentioned above stand for:
   Lu - uppercase letters, Ll - lowercase letters, Lt - titlecase letters, Lm - modifier letters, Lo - other letters, Nl - letter numbers, Mn - nonspacing marks, Mc - spacing combining marks, Nd - decimal numbers, Pc - connector punctuations, Other_ID_Start - explicit list of characters in PropList.txt to support backwards compatibility, Other_ID_Continue - likewise

      

It is, however, not normally necessary to consult the reference, as it is generally sufficient to simply provide an abstract template for legal expressions:

if boolean_expression:
   statement(s)
else:
   statement(s)


or even a typical example as in the Python tutorial.

>>> if x < 0:
...     x = 0
...     print('Negative changed to zero')
... elif x == 0:
...     print('Zero')
... elif x == 1:
...     print('Single')
... else:
...     print('More')         

 

However, there are cases where the less formal documentation is not sufficient. For example, notice the definition for the "if_stmt" in the BNF, which includes an "assignment_expression" following the "if". In turn, "assignment_expression" is defined with the  "[identifier ":="] expression". The "expresion" part can be idenitified in the less formal documentation, corresponding to the "boolean_expression" and "x < 0" in the other two definitions. However, the optional "[identifier ":="]" does not appear in these other definitions. The construct is in fact the "assignment expression" introduced in PEP 572, dated 28-Feb-2018 for Python 3.8. The assignment expression can be used to simplify code in some cases. For example by assigning the value of len(a) to the variable n, len(a) only needs to be calculated once in the following code fragment.

if (n := len(a)) > 10:
   print(f"List is too long ({n} elements, expected <= 10)")


"If" statements of this form will not be found in code written prior to February 2018 and any LM trained on a corpus containing a majority of code written before that date will not be able to generate such statements and will have no information that the statement is legal. A human programmer, on the other hand, can simply consult the BNF and see that it is a legal production. Perhaps a powerful LM could learn about these statements after a few exposures through few shot learning, but this has its own difficulties. For example, consider if the new code included code from students who may have made a (very legitimate) mistake in using the wrong assignment operator, as in the modified code below:

if (n = len(a)) > 10:
   print(f"List is too long ({n} elements, expected <= 10)")

Consulting the BNF instantly shows that this is not a well-formed statement, but a machine learning model that does not have access to the BNF cannot make this determination. The power to generalize with few or even no examples, while constraining the generalization to only legal productions is the power of classical symbolic systems that non symbolic systems cannot replicate. This is the power that the human mind appears to possess in abundance.

Eliminative connectionism eliminates connectionism


We must then consider how a LM can generate code which conforms to the syntax of a phrase structure language. One possibility is that the LM learns representations and operations that are isomorphic to the symbols and rules of the programming language and uses these representations to generate well-formed lines of code. This possibility is described by Pinker and Prince as implementational connectionism, since in this case the network acts as a physical implementation of the algorithm, as described by Marr. This possibility is stronger than Browning and LeCun's claim that the network learns a non-classical type of symbol manipulation. More importantly this strong version of implementational connectionism is of little concern for classical theorists because it does not change any knowledge we already had. Simply put, if we have already defined a language like Python completely through the BNF, a neural network cannot reveal anything new about the language if all it does is to implement the rules in its neural hardware.

A second option is that the LMs have learned unpredictable, complex nonlinear mappings and latent variables which can generate well-formed code. Pinker and Prince call this possibility eliminative connectionism, which poses a more serious challenge for classical theories because they eliminate the need for rules and symbol manipulation. In eliminative (neural) systems it is impossible to find a principled mapping between the components of the distributed (vector) processing model and the steps involved in a symbol-processing theory. It is clear that the current bunch of deep learning models are advertised as eliminative systems. Browning and LeCun's neural symbol manipulations are specifically claimed to eliminate traditional rules of symbolic logic, and their rules of symbol manipulation are specifically not meant to be isomorphic to traditional rules. Other arguments for an eliminative intent include Bengio, LeCun and Hinton argued in their Turing lecture that continuous representations in Deep Learning models fundamentally differentiate neural LMs from traditional symbolic systems such as grammar because they enable computations based on non-linear transformations between the representing vectors themselves.  


While the compuational core of neural LMs is vector based, they can perform tasks involving symbols because they use a symbolic representation in the input and output layers. Kautz enumerates different classes of Neuro-Symbolic hybrid systems which combine neural and symbolic approaches in different ways. He identifies the default paradigm in neural language models as the Symbolic Neuro symbolic (SNS) architecture, where sequences of words (symbolic) are converted to vectors which are passed to a neural network (neuro) whose output is computed by a softmax operation on the final layer of the network (symbolic). In the case of models trained for code generation, both natural language and code tokens are converted to- and from- vector representations and processed by the network. SNS systems can accurately generate well-formed productions of a rule governed symbolic system without access to the rules themselves, because they are extremely competent at generalizing the properties of observed productions. The problem is that we know for certain that this is not the right model for Python because the right model for Python is a classical symbolic system with a generative phrase structure grammar. If the LM can mimic a classical symbolic Python, then why should we believe that it isn't mimicking a classical symbolic natural language?

We have now arrived at the real problem with neural networks as models of real or artificial intelligence. Judging by their performance on coding problems we could claim that they have essentially solved the problem and all they need is more scale or a "world model" to improve even further. But we would be wrong. What they really need is a symbol manipulation system. We are in a privileged position when it comes to knowing how Python works because "we" wrote it. The same is of course not the case for cognitive abilities such as language. Chomsky hypothesized that it involves a phrase structure grammar of some sort, but this is challenged by the impressive success of neural models. The problem is that there is no independent way to know if the challenges are valid, or if instead, the non-linear symbol transformations inherent in the SNS paradigm are sufficient to yield the results without "really" doing language, the same way they aren't "really" doing Python.

The most likely hypothesis is that the brain contains a number of classical symbolic systems which generate highly structured productions and correlated features which can in turn be exploited by statistical methods to construct models that can reconstruct similar productions in novel circumstances. There is no doubt humans make use of this kind of statistical mechanism: programmers don't all consult the BNF, many of them simply look at a few example statements and copy the form. But we have seen that humans can also use classical symbolic structures in their reasoning when they use the BNF (or equivalent defining syntax). Some of these classical structures might operate at a level not accessible to conscious introspection, for example generative grammar. Neural language models have proven themselves unable to illuminate the formal structure of Python, and it is not reasonable to claim that they are any more able to enlighten us about English. DL models can learn the mappings where the symbolic system produces lots of examples, like language. When the symbol system is used for planning, creativity, etc., this is where DL struggles to learn.

Conclusion

Deep Learning is hardly the foundation for a general artificial intelligence. It is, however, a powerful foundation for systems that can exploit the productions of complex rule based systems in order to solve problems in the domain of that system. Combined with symbolic reasoning, deep learning has the potential for very powerful systems indeed. In particular the technology has provided a new set of tools to realize Licklider's vision from 1960, who similarly grappled with the future of AI. In 1960 the Air Force estimated that "... it would be 1980 before developments in artificial intelligence make it possible for machines alone to do much thinking or problem solving of military significance". That is, 20 years to something like AGI. He suggested that those 20 years would be well spent developing "man-machine" symbiosis, which was an approach to programming computers in a way that maximally augments human reasoning rather that replaces it. He estimated 5 years to develop and 15 years to use the systems. Mockingly, he added "... the 15 may be 10 or 500, but those years should be intellectually the most creative and exciting in the history of mankind."  I think we are no closer now than Licklider was to predict if AGI is 10 or 500 years away. But I do think that we are much closer at achieving symbiotic systems. DL networks can compress the world's knowledge into a manageable collection of vectors but without symbolic systems, they don't know what to do with them. Humans have the creativity and insight to interact with the vectors to unleash breathtaking discoveries.


 

New Comment
8 comments, sorted by Click to highlight new comments since:

I haven't read this post, I just wanted to speculate about the downvoting, in case it helps.

Assigning "zero" probability is an infinite amount of error.  In practice you wouldn't be able to compute the log error.  More colloquially, you're infinitely confident about something, which in practice and expectation can be described as being infinitely wrong.  Being mistaken is inevitable at some point.  If someone gives 100% or 0%, that's associated with them being very bad at forecasting.

I expect a lot of the downvotes are people noticing that you gave it 0%, and that's strong evidence you're very uncalibrated as a forecaster.  For what it's worth, I'm in the highscores on Metaculus, and I'd interpret that signal the same way.

Skimming a couple seconds more, I suspect the overall essay's writing style doesn't really explain how the material changes our probability estimate.  This makes the essay seem indistinguishable from confused/irrelevant arguments about the forecast.  For example if I try skim reading the Conclusion section, I can't even tell if the essay's topics really change the probability that human jobs can be done by some computer for $25/hr or less (that's the criteria from the original prize post).

I have no reason not to think you were being genuine, and you are obviously knowledgeable.  I think a potential productive next step could be if you consulted someone with a forecasting track record, or read Philip Tetlock's stuff.  The community is probably reacting to red flags about calibration, and (possibly) a writing style that doesn't make it clear how this updates the forecast.

Thanks! I guess I didn't know the audience very well and I wanted to come up with an eye catching title. It was not meant to be literal. I should have gone with "approximately Zero" but I thought that was silly. Maybe I can try and change it.

That's a really good idea, changing the title.  You can also try adding a little paragraph in italics, as a brief little note for readers clarifying which proability you're giving.

Thank you for changing it to be less clickbaity. Downvotes removed.

Also I was more focused on the sentence following the one where your quote comes from:

"This includes entirely AI-run companies, with AI managers and AI workers and everything being done by AIs."

and "AGI will be developed by January 1, 2100"

I try and argue that the answer to these two proposals is approximately Zero.

Heavy down votes in less than a day, most likely from the title alone. I can't bring myself to vote up or down because I'm not clear on your argument. Most of what you are saying supports AGI as sooner rather than later (that DL just needs that little bit extra, which should have won this crowd over). I don't see any arguments to support the main premise. 

*I stated that the AGI clock doesn't even start as long as ML/DL remains the de facto method, but that isn't what they want to hear either.  

Probably because a probability of zero is a red flag, as nothing should be given a probability of zero outside of mathematics.

[This comment is no longer endorsed by its author]Reply

Thanks for your comment. I was baffled by the downvotes because as far as I could tell most people hadn't read the paper. Your comment that maybe it was the title, is profoundly disappointing to me. I do not know this community well but it sounds from your comment that they are not really interested in hearing arguments that contradict their point of view.
As for my argument, it was not supporting AGI at all. Basically,  I was pointing out that every serious researcher now agrees that we need DL+symbols. The disagreement is in what sort of symbols. Then I argue that none of the current proposals for symbols is any good for AGI. So that kills AGI.