One thing that I'm slightly puzzled by is that an obvious improvement to LLMs would be adding some kind of long-term memory that would allow them to retain more information than fits their context window. Naively, I would imagine that even just throwing some recurrent neural net layers in there would be better than nothing?
But while I've seen LLM papers that talk about how they're multimodal or smarter than before, I don't recall seeing any widely-publicized model that would have extended the memory beyond the immediate context window, and that confuses me.
Transformers take O(n^2) computation for a context window of size n, because they effectively feed everything inside the context window to every layer. It provides the benefits of a small memory, but it doesn’t scale. It has no way of remembering things from before the context window, so it’s like a human with a busted hippocampus (Korsakoff’s syndrome) who can‘t make new memories.