Review

Let's say that we are training an LLM and have 3 selections of writing, A, B, and C, each of which is then broken down into 3 parts. Call them A1, A2, ect.

Is there a benefit to structuring the batches like this:

Batch 1: {A1,B1,C1}

Batch 2: {A2,B2,C2}

Batch 3: {A3,B3,C3}

such that the most recent batches can provide information for the current prediction? 

My understanding is that human episodic memory works something like this, but that current neural networks have a learning rate that is too low for this to be useful. Have there been experiments run on this? I feel like this is an obvious idea and has been examined exhaustively, but I just don't know the right vocabulary. 

New Comment
4 comments, sorted by Click to highlight new comments since:

There has been work structuring batches like this! But as far as I know only with deliberately provided external memory, rather than trying to rely on the sort of innate-recent-recall Transformers might have.

Specifically, if you look at page 4 of Memorizing Transformers it pretty much has exactly your chart. Memorizing transformers uses an approximation to KNN as a non-differentiable substitute for attention is a handful of layers, and gets much much longer effective context length with this.

This (or something like this) might be behind Anthropic's 100k attention length, particularly because you can add this to a pre-trained Transformer and have it just work -- or it might not, there's a bunch of ways to try to extend effective attention.

(I don't thiiiink this would work very well without some kind of addition to transformer architecture, because I don't think the training process in batch 2 will teach it how to access whatever was changed in the weights by batch 1.)

Thanks, that was very informative. I'll be tinkering with it as I upskill on LLMs.

Depends on what you want to do. Look at "dynamic evaluation" (bibliography) for something with a learning rate which is not using an external memory like neural cache etc.

I'm mostly just curious about how difficult it is for a transformer to learn to effectively access information from recent backprops, without using outside structures. Can it pull an essay title? General topic? And how well does this work for stochastic vs. batch processing? Thanks a lot btw.