I’m a bit annoyed that Hassabis is giving neuroscience credit for the idea of episodic memory.
That’s not my understanding. To me he is giving neuroscience credit for the ideas that made possible to implement a working memory in LLM. I guess he didn’t want to use words like thalamocortical, but from a neuroscience point of view transformers indeed look inspired by the isocortex, e.g. by the idea that a general distributed architecture can process any kind of information relevant to a human cognitive architecture.
Yeah, he's talking about neuroscience. I get that. But "episodic memory" is a term of art and the idea behind it didn't come from neuroscience. It's quite possible that he just doesn't know the intellectual history and is taking "episodic memory" as a term that's in general use, which it is. But he's also making claims about intellectual history.
Because he's using that term in that context, I don't know just what claim he's making. Is he also (implicitly) claiming that neuroscience is the source of the idea? If he thinks that, then he's wrong. If he's just saying that he got the idea from neuroscience, OK.
But, the idea of a "general distributed architecture" doesn't have anything to do with the idea of episodic memory. They are orthogonal notions, if you will.
Your point is « Good AIs should have a working memory, a concept that comes from psychology ».
DH point is « Good AIs should have a working memory, and the way to implement it was based on concepts taken from neuroscience ».
That’s indeed orthogonal notions, if you will.
I did a little checking. It's complicated. In 2017 Hassibis published an article entitled "Neuroscience-Inspired Artificial Intelligence" in which he attributes the concept of episodic memory to a review article that Endel Tulving published in 2002, "EPISODIC MEMORY: From Mind to Brain." That article has quite a bit to say about the brain. In the 2002 article Tulving dates the concept to an article he published in 1972. That article is entitled "Episodic and Semantic Memory." As far as I know, while there are precedents – everything can be fobbed off on Plato if you've a mind to do it, that's where the notion of episodic memory enters in to modern discussions.
Why do I care about this kind of detail? First, I'm a scholar and it's my business to care about these things. Second, a lot of people in contemporary AI and ML are dismissive of symbolic AI from the 1950s through the 1980s and beyond. While Tulving was not an AI researcher, he was very much in the cognitive science movement, which included philosophy, psychology, linguistics, and AI (later on, neuroscientists would join in). I have no idea whether or not Hassibis is himself dismissive of that work, but many are. It's hypocritical to write off the body of work while using some of the ideas. These problems are too deep and difficult to write off whole bodies of research in part because they happened before you were born – FWIW Hassibis was born in 1976.
I have no idea whether or not Hassibis is himself dismissive of that work
Well that’s a problem, don’t you think?
but many are.
Yes, as a cognitive neuroscientist myself, you’re right that many within my generation tend to dismiss symbolic approaches. We were students during a winter that many of us thought caused by the over promising and under delivering of the symbolic approach, with Minsky as the main reason for the slow start of neural networks. I bet you have a different perspective. What’s your three best points for changing the view of my generation?
I'll get back to you tomorrow. I don't think it's a matter of going back to the old ways. ANNs are marvelous; they're here to stay. The issue is one of integrating some symbolic ideas. It's not at all clear how that's to be done. If you wish, take a look at this blog post: Miriam Yevick on why both symbols and networks are necessary for artificial minds.
Fascinating paper! I wonder how much they would agree that holography means sparse tensors and convolution, or that the intuitive versus reflexive thinking basically amount to visuo-spatial versus phonological loop. Can’t wait to hear which other idea you’d like to import from this line of thought.
Miriam Lipshutz Yevick was born in 1924 and died in 2018, so we can't ask her these questions. She fled Europe with her family inn 1940 for the same reason many Jews fled Europe and ended up in Hoboken, NJ. Seven years later she got a PhD in math from MIT; she was only the 5th woman to get that degree from MIT. But, as both a woman and a Jew, she had almost no chance of an academic post in 1947. She eventually got an academic gig, but it was at a college oriented toward adult education. Still, she managed to do some remarkable mathematical work.
The two papers I mention in that blog post were written in the mid-1970s. That was the height of classic symbolic AI and the cognitive science movement more generally. Newell and Simon got their Turing Award in 1975, the year Yevick wrote that remarkable 1975 paper on holographic logic, which deserves to be more widely known. She wrote as a mathematician interested in holography (an interest she developed while corresponding with physicist David Bohm in the 1950s), not as a cognitive scientist. Of course, in arguing for holography as a model for (one kind of) thought, she was working against the tide. Very few were thinking in such terms at that time. Rosenblatt's work was in the past, and had been squashed by Minsky and Pappert, as you've noted. The West Coast connectionist work didn't jump off until the mid-1980s.
So there really wasn't anyone in the cognitive science community at the time to investigate the line of thinking she initiated. While she wasn't thinking about real computation, you know, something you actually do on computers, she thought abstractly in computational terms, such as Turing and others did (though Turing also worked with actual computers). It seems to me that her contribution was to examine the relationship between a computational regime and the objects over which he was asked to compute. She's quite explicit about that. If the object tends toward geometrical simplicity – she was using identification of visual objects as her domain – then a conventional, sequential, computational regime was most effective. What's what cognitive science was all about at the time. If the object tends toward geometrical complexity then a different regime was called for, what she called holographic or Fourier logic. I don't know about sparse tensors, but convolution, yes.
Later on, in the 1980s, as you may know, Hans Moravic would talk about a paradox (which became named after him). In the early days of AI, researchers worked on abstract domains, like chess and theorem proving, domains that take a high level cognitive ability. Things went pretty well, though the extravagant predictions had yet to pan out. When they turned toward vision and language in the late 1960s and into the 70s and 80s, things fell apart. Those were things that young kids could do. The paradox, then, was that AI was most effective at cognitively difficult things, and least effective with cognitively simple things.
The issue was in fact becoming visible in the 1970s. I read about it in David Marr, and he died in 1980. Had it been explicitly theorized when Yevick wrote? I don't know. But she had an answer to the paradox. The computational regime favored by AI and the cognitive sciences at the time simply was not well-suited to complex visual objects, though they presented to problems to 2-year-olds, or to language, with all those vaguely defined terms anchored in physically complex phenomena. They needed a different computational regime, and eventually we got one, though not really until GPUs were exploited.
More later, perhaps.
Thanks, I didn’t know this perspective on the history of our science. The stories I most heard were indeed more about HH model, Hebb rule, Kohonen map, RL, and then connexionism became deep learning..
If the object tends toward geometrical simplicity – she was using identification of visual objects as her domain – then a conventional, sequential, computational regime was most effective.
…but neural networks did refute that idea! I feel like I’m missing something here, especially since you then mention GPU. Was sequential a typo?
When I hear « conventional, sequential, computational regime », my understanding is « the way everyone was trying before parallel computation revolutionized computer vision ». What’s your definition so that using GPU feels sequential?
Oh, I didn't mean to say imply that using GPUs was sequential, not at all. What I meant was that the connectionist alternative didn't really take off until GPUs were used, making massive parallelism possible.
Going back to Yevick, in her 1975 paper she often refers to holographic logic as 'one-shot' logic, meaning that the whole identification process takes place in one operation, the illumination of the hologram (i.e. the holographic memory store) by the reference beam. The whole memory 'surface' is searched in one unitary operation.
In an LLM, I'm thinking of the generation of a single token as such a unitary or primitive process. That is to say, I think of the LLM as a "virtual machine" (I first saw the phrase in a blog post by Chris Olah) that is running an associative memory machine. Physically, yes, we've got a massive computation involving every parameter and (I'm assuming) there's a combination of massive parallel and sequential operations taking place in the GPUs. Complete physical parallelism isn't possible (yet). But there are no logical operations taking place in this virtual operation, no transfer of control. It's one operation.
Obviously, though, considered as an associative memory device, an LLM is capable of much more than passive storage and retrieval. It performs analytic and synthetic operations over the memory based on the prompt, which is just a probe ('reference beam' in holographic terms) into an associative memory. We've got to understand how the memory is structured so that that is possible.
More later.
A few comments before later. 😉
What I meant was that the connectionist alternative didn't really take off until GPUs were used, making massive parallelism possible.
Thanks for the clarification! I guess you already noticed how research centers in cognitive science seem to have a failure mode over a specific value question: Do we seek excellence at the risk of overfitting funding agency criterion, or do we seek fidelity to our interdisciplinary mission at the risk of compromising growth?
I certainly agree that, before the GPUs, the connectionist approach had a very small share of the excellence tokens. But it was already instrumental in providing a common conceptual framework beyond cognitivism. As an example, even the first PCs were enough to run toy examples of double dissociation using networks structured by sensory type rather than by cognitive operation. From a neuropsychological point of view, that was already a key result. And for the neuroscientist in me, toy models like Kohonen maps were already key to make sense of why we need so many short inhibitory neurons in grid-like cortical structures.
Going back to Yevick, in her 1975 paper she often refers to holographic logic as 'one-shot' logic, meaning that the whole identification process takes place in one operation, the illumination of the hologram (i.e. the holographic memory store) by the reference beam. The whole memory 'surface' is searched in one unitary operation.
Like a refresh rate? That would fit the evidence for a 3-7 Hz refresh rate of our cartesian theater, or the way LLMs go through prompt/answer cycles. Do you see other potential uses for this concept?
We've got to understand how the memory is structured so that that is possible.
What’s wrong with « the distributed way »?
In a paper I wrote awhile back I cite the late Walter Freeman as arguing that "consciousness arises as discontinuous whole-hemisphere states succeeding one another at a "frame rate" of 6 Hz to 10 Hz" (p. 2). I'm willing to speculate that that's your 'one-shot' refresh rate. BTW, Freeman didn't believe in a Cartesian theater and neither do it; the imagery of the stage 'up there' and the seating area 'back here' is not at all helpful. We're not talking about some specific location or space in the brain; we're talking about a process.
Well, of course, "the distributed way." But what is that? Prompt engineering is about maneuvering your way through the LLM; you're attempting to manipulate the structure inherent in those weights to produce a specific result you want.
That 1978 comment of Yevick's that I quote in that blog post I mentioned somewhere up there, was in response to an article by John Haugeland evaluating cognitivism. He wondered whether or not there was an alternative and suggested holography as a possibility. He didn't make a very plausible case and few of the commentators took is as a serious alternative.
People were looking for alternatives. But it took awhile for connectionism to build up a record of interesting results, on the one hand, for cognitivism to begin seeming stale on the other hand. It's the combination of the two that brought about significant intellectual change. Or that's my speculation.
I'm willing to speculate that [6 Hz to 10 Hz ]that's your 'one-shot' refresh rate.
It’s possible. I don’t think there was relevant human data in Walter Freeman time, so I’m willing to speculate that’s indeed the frame rate in mouse. But I didn’t check the literature he had access to, so just a wild guess.
the imagery of the stage 'up there' and the seating area 'back here' is not at all helpful
I agree there’s no seating area. I still find the concept of a cartesian theater useful. For exemple, it allows knowing where to plant electrodes if you want to access the visual cartesian theater for rehabilitation purposes. I guess you’d agree that can be helpful. 😉
We're not talking about some specific location or space in the brain; we're talking about a process.
I have friends who believe that, but they can’t explain why the brain needs that much ordering in the sensory areas. What’s your own take?
But what is [the distributed way]that?
You know backprop algorithm? That’s a mathematical model for the distributed way. It was recently shown that it produces networks that explains (statistically speaking) most the properties of the BOLD cortical response in our visial systems. So, whatever the biological cortices actually do, it turns equivalent for the « distributed memory » aspect.
Or that's my speculation.
I wonder if that’s too flattering for connectionism, which mostly stalled until the early breakthrough in computer vision suddenly attract every labs. BTW
Is accessing the visual cartesian theater physically different from accessing the visual cortex? Granted, there's a lot of visual cortex, and different regions seem to have different functions. Is the visual cartesian theater some specific region of visual cortex?
I'm not sure what your question about ordering in sensory areas is about.
As for backprop, that gets the distribution done, but that's only part of the problem. In LLMs, for example, it seems that syntactic information is handled in the first few layers of the model. Given the way texts are structured, it makes sense that sentence-level information should be segregated from information about collections of sentences. That's the kind of structure I'm talking about. Sure, backprop is responsible for those layers, but it's responsible for all the other layers as well. Why do we seem to have different kinds of information in different layers at all? That's what interests me.
Actually, it just makes sense to me that that is the case. Given that it is, what is located where? As for why things are segregated by location, that does need an answer, doesn't it. Is that what you were asking?
Finally, here's an idea I've been playing around with for a long time: Neural Recognizers: Some [old] notes based on a TV tube metaphor [perceptual contact with the world].
Is accessing the visual cartesian theater physically different from accessing the visual cortex? Granted, there's a lot of visual cortex, and different regions seem to have different functions. Is the visual cartesian theater some specific region of visual cortex?
In my view: yes, no. To put some flesh on the bone, my working hypothesis is: what’s conscious is gamma activity within an isocortex connected to the claustrum (because that’s the information which will get selected for the next conscious frame/can be considered as in working memory)
I'm not sure what your question about ordering in sensory areas is about.
You said: what matters is temporal dynamics. I said: why so many maps if what matters is timing?
Why do we seem to have different kinds of information in different layers at all? That's what interests me.
The closer to the input, the more sensory. The closer to the output, the more motor. The closer to the restrictions, the easier to interpret activity as latent space. Is there any regularity that you feel hard to interpret this way?
Finally, here's an idea I've been playing around with for a long time:
Thanks, I’ll go read. Don’t hesitate to add other links that can help understand your vision.
"You said: what matters is temporal dynamics"
You mean this: "We're not talking about some specific location or space in the brain; we're talking about a process."
If so, all I meant was a process that can take place pretty much anywhere. Consciousness can pretty much 'float' to wherever its needed.
Since you asked for more, why not this: Direct Brain-to-Brain Thought Transfer: A High Tech Fantasy that Won't Work.
You mean this: "We're not talking about some specific location or space in the brain; we're talking about a process."
You mean there’s some key difference in meaning between your original formulation and my reformulation? Care to elaborate and formulate some specific prediction?
As an example, I once gave a try at interpreting data from olfactory system for a friend who were wondering if we could find sign of an chaotic attractor. If you ever toy with Lorenz model, one key feature is: you either see the attractor by plotting x vs vs z, or you can see it by plotting one of these variable only vs itself at t+delta vs itself at t+2*delta (for many deltas). In other words, that gives a precise feature you can look for (I didn’t find any, and nowadays it seems accepted that odors are location specific, like every other sense). Do you have a better idea or it’s more or less what you’d have tried?
I've lost the thread entirely. Where have I ever said or implied that odors are not location specific or that anything else is not location specific. And how specific are you about location? Are we talking about centimeters (or more), millimeters, individual cortical columns?
What's so obscure about the idea that consciousness is a process that can take place pretty much anywhere, though maybe its confined to interaction within the cortex and between subcortical areas, I've not given that one much thought. BTW, I take my conception of consciousness from William Powers, who didn't speculation about its location in the brain.
Nothing at all. I’m big fan of these kind of ideas and I’d love to present yours to some friends, but I’m afraid they’ll get dismissive if I can’t translate your thoughts into their usual frame of reference. But I get you didn’t work this aspect specifically, there’s many fields in cognitive sciences.
About how much specificity, it’s up to interpretation. A (1k by 1k by frame by cell type by density) tensor representing the cortical columns within the granular cortices is indeed a promising interpretation, although it’d probably be short of an extrapyramidal tensor (and maybe an agranular one).
Well, when Walter Freeman was working on the olfactory cortex of rodents he was using a surface mounted 8x8 matrix of electrodes. I assume that measured in millimeters. In his 1999 paper Consciousness, Intentionality, and Causality (paragraphs 36 - 43) a hemisphere-wide global operator (42):
I propose that the globally coherent activity, which is an order parameter, may be an objective correlate of awareness through preafference, comprising expectation and attention, which are based in prior proprioceptive and exteroceptive feedback of the sensory consequences of previous actions, after they have undergone limbic integration to form Gestalts, and in the goals that are emergent in the limbic system. In this view, awareness is basically akin to the intervening state variable in a homeostatic mechanism, which is both a physical quantity, a dynamic operator, and the carrier of influence from the past into the future that supports the relation between a desired set point and an existing state.
Later (43):
What is most remarkable about this operator is that it appears to be antithetical to initiating action. It provides a pervasive neuronal bias that does not induce phase transitions, but defers them by quenching local fluctuations (Prigogine, 1980). It alters the attractor landscapes of the lower order interactive masses of neurons that it enslaves. In the dynamicist view, intervention by states of awareness in the process of consciousness organizes the attractor landscape of the motor systems, prior to the instant of its next phase transition, the moment of choosing in the limbo of indecision, when the global dynamic brain activity pattern is increasing its complexity and fine-tuning the guidance of overt action. This state of uncertainty and unreadiness to act may last a fraction of a second, a minute, a week, or a lifetime. Then when a contemplated act occurs, awareness follows the onset of the act and does not precede it.
He goes on from there. I'm not sure whether he came back to that idea before he died in 2016. I haven't found it, didn't do an exhaustive search, but I did look.
Cross-posted from New Savanna.
MIT Center for Minds, Brains, and Machines (CBMM), a panel discussion: CBMM10 - A Symposium on Intelligence: Brains, Minds, and Machines.
On which critical problems should Neuroscience, Cognitive Science, and Computer Science focus now? Do we need to understand fundamental principles of learning -- in the sense of theoretical understanding like in physics -- and apply this understanding to real natural and artificial systems? Similar questions concern neuroscience and human intelligence from the society, industry and science point of view.
Panel Chair: T. Poggio
Panelists: D. Hassabis, G. Hinton, P. Perona, D. Siegel, I. Sutskever
Quick Comments
1.) I’m a bit annoyed that Hassabis is giving neuroscience credit for the idea of episodic memory. As far as I know, the term was coined by a cognitive psychologist named Endel Tulving in the early 1970s, who stood it in opposition to semantic memory. That distinction was all over the place in the cognitive sciences in the 1970s and its second nature to me. When ChatGPT places a number of events in order to make a story, that’s episodic memory.
2.) Rather than theory, I like to think of what I call speculative engineering. I coined the phrase in the preface to my book about music (Beethoven’s Anvil), where I said:
3.) On Chomsky (Hinton & Hassabis): Yes, Chomsky is fundamentally wrong about language. Language is primarily a tool for conveying meaning from one person to another and only derivatively a tool for thinking. And he’s wrong that LLMs can learn any language and therefore they are useless for the scientific study of language. Another problem with Chomsky’s thinking is that he has no interest in process, which is in the realm of performance, not competence.
Let us assume for the sake of argument that the introduction of a single token into the output stream requires one primitive operation of the virtual system being emulated by an LLM. By that I mean that there is no logical operation within the process, no AND or OR, no shift of control; all that’s happening is one gigantic calculation involving all the parameters in the system. That means that the number of primitive operations required to produce a given output is equal to the number of tokens in that output. I suggest that that places severe constraints on the organization of the LLM’s associative memory.
Contrast that with what happens in a classical symbolic system. Let us posit that each time a word (not quite the same as a token in an LLM, but the difference is of no consequence) is emitted, that itself requires a single primitive operation in the classical system. Beyond that, however, a classical system has to execute numerous symbolic operations in order to arrive at each word. Regardless of just how those operations resolve into primitive symbolic operations, the number has to be larger, perhaps considerably larger, than the number of primitive operations an LLM requires. I suggest that this process places fewer constraints on the organization of a symbolic memory system.
At this point I’ve reached 45:11 in the video, but I have to stop and think. Perhaps I’ll offer some more comments later.
LATER: Creativity
4.) Near the end (01:20:00 or so) the question of creativity comes up. Hassibis AIs aren't there yet. Hinton brings up analogy, pointing out that, with all the vast knowledge LLMs have ingested, they're got opportunities for coming up with analogy after analogy after analogy. I've got experience with ChatGPT that's directly relevant to those issues, analogy and creativity.
One of the first things I did once I started playing with ChatGPT was have it undertake a Girardian interpretation of Steven Spielberg's Jaws. To do that it has to determine whether or not there is an analogy between events in the film and the phenomena that Girard theorizes about. It did that fairly well. So I wrote that up and published it in 3 Quarks Daily, Conversing with ChatGPT about Jaws, Mimetic Desire, and Sacrifice. Near the end I remarked:
These observations are informal and are only about a single example. Given those limitations it's difficult to imagine a generalization. But I didn't hear anything from those experts that was comparably rich. Hinton have one example of an analogy that he posed to GPT-4 (01:18:30): “What has a compost heap got in common with an atom bomb?” It got the answer he was looking for, chain reaction, albeit at different energy levels and different rates. That's interesting. Why wasn't the panel ready with 20 such examples among them?
Do they not have more such examples from their own work? Don't they think about their own work process, all the starts and stops, the wandering around, the dead ends and false starts, the open-ended exploration, that came before final success. And even then, no success is final, but only provisional pending further investigation. Can they not see the difference between what they do and what their machines do? Do they think all the need for exploration will just vanish in the face of machine superintelligence. Do they really believe that the universe is that small?
STILL LATER: Hinton and Hassabis on analogies
Hinton continues with analogies and Hassabis weights in:
Well, yeah, sure, GPT-4 has all this stuff in its model, way more topics than any one human. But where’s GPT-4 going to “stand” so it can “look over” all that stuff and spot the analogies? That requires some kind of procedure. What is it?
For example, it might partition all that knowledge into discrete bits and then set up a 2D matrix with a column and a row for each discrete chunk of knowledge. Then it can move systematically through the matrix, checking each cell to see whether or not the pair in that cell is a useful analogy. What kind of tests does it apply to make that determination? I can imagine there might be a test or tests that allows a quick and dirty rejection for many candidates. But those that remain, what can you do but see if any useful knowledge follows from trying out the analogy. How long will that determination take? And so forth.
That’s absurd on the face of it. What else is there? I just explained what I went through to come up with an analogy between Jaws and Girard. But that’s just my behavior, not the mental process that’s behind the behavior. I have no trouble imagining that, in principle, having these machines will help speed up the process, but in the end I think we’re going to end up with a community of human investigators communicating with one another while they make sense of the world. The idea, which, judging from remarks he’s made elsewhere, Hinton seems to hold, that one of these days we’ll have a machine that takes humans out of the process all together, that’s an idle fantasy.