Models with long-term memory are very hard to train. Instead of being able to compute a weight update after seeing a single input, you have to run in a long loop of ”put thing in memory, take thing out, compute with it, etc” before you can compute a weight update. It’s not a priori impossible, but nobody’s managed to get it to work. Evolution has figured out how to do it because it’s willing to waste an entire lifetime to get a single noisy update.
People have been working on this for years. It’s remarkable (in retrospect, to me) that we’ve gotten as far as we have without long term memory.
Isn't that the point of the original transformer paper? I have not actually read it, just going by summaries read here and there.
If I don't misremember RNN should be expecially difficult to train in parallel
I suspect much of the reason we didn't need much long term memory is that we can increase the context window pretty cheaply, thus long-term memory is deprioritized.
There is an architecture called RWKV which claims to have an 'infinite' context window (since it is similar to an RNN). It claims to be competitive with GPT-3. I have no idea whether this is worth taking seriously or not.
I don't think it's fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn't have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
Two links related to RWKV to know more :
Given that LLM’s can use tools, it sounds like a traditional database might be able to be used. The data would still have to fit inside the context window, along with the generated continuation prompt, but that might work for a lot of cases.
I could also imagine this working without explicit tool use. There are already systems for querying corpuses (using embeddings to query vector databases, from what I've seen). Perhaps the corpus could be past chat transcripts, chunked.
I suspect the trickier part would be making this useful enough to justify the additional computation.
One thing that comes to mind is DeepMind's Adaptive Agents team using Transformer-XL, which can attend to data outside the current context window. I think there was speculation that GPT-4 may also be a Transformer-XL, but I'm not sure how to verify that.
Briefly read a Chat-GPT description of Transformer-XL - is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn't?
On mobile but FYI langchain implements some kind of memory.
Also, this other post might interest you. It's about asking GPT to decide when to call a memory module to store data : https://www.lesswrong.com/posts/bfsDSY3aakhDzS9DZ/instantiating-an-agent-with-gpt-4-and-text-davinci-003
Given that we know that LLM’s can use tools, can traditional databases be used for long-term memory?
I think there has been a lot of research in the past in this space. The first thing that popped into my mind was https://huggingface.co/docs/transformers/model_doc/rag
Currently, there are some approaches using langchain to persist the history of a conversation into an embeddings database, and retrieve the relevant parts performing a similar query / task.
One thing that I'm slightly puzzled by is that an obvious improvement to LLMs would be adding some kind of long-term memory that would allow them to retain more information than fits their context window. Naively, I would imagine that even just throwing some recurrent neural net layers in there would be better than nothing?
But while I've seen LLM papers that talk about how they're multimodal or smarter than before, I don't recall seeing any widely-publicized model that would have extended the memory beyond the immediate context window, and that confuses me.