One thing that I'm slightly puzzled by is that an obvious improvement to LLMs would be adding some kind of long-term memory that would allow them to retain more information than fits their context window. Naively, I would imagine that even just throwing some recurrent neural net layers in there would be better than nothing?
But while I've seen LLM papers that talk about how they're multimodal or smarter than before, I don't recall seeing any widely-publicized model that would have extended the memory beyond the immediate context window, and that confuses me.
I don't think it's fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn't have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.