Thanks for thinking about this, I think this is an important topic!
Inside the AI's chain-of-thought, each forward pass can generate many English tokens instead of one, allowing more information to pass through the bottleneck.
I wonder how one would do this; do you mean allow the model to output a distribution of tokens for each output position? (and then also read-in that distribution) I could imagine this being somewhere between normal CoT and latent (neuralese) CoT!
After the chain-of-thought ends, and the AI is giving its final answer, it generates only one English token at a time, to make each token higher quality. The architecture might still generate many tokens in one forward pass, but a simple filter repeatedly deletes everything except its first token from the context window.
If my interpretation of your idea above is correct then I imagine this part would look just like top-k / top-p generation like it is done currently, which seems sensible.
I'm only ~30% certain that I correctly understood your idea though so I'd love if you could clarify how this generating many tokens idea looks like!
Haha, my idea was just "maybe to solve this information bottleneck problem, we should solve the generate-many-tokens-in-one-pass problem."
I haven't really thought of any solution to the generate-many-tokens-in-one-pass problem yet :/
I'll edit the post to mention this.
One stupid attempt to solve the "generate-many-tokens-in-one-pass" problem, is to start off with the main LLM outputting 1 token at a time, and a small cheap LLM outputting the next 5 tokens. You then let the small LLM eavesdrop on the residual stream of the main LLM, and use reinforcement learning on both the main LLM and the small LLM.
The hope is that the main LLM will eventually learn to use part of its residual stream to communicate to the small LLM, and tell the small LLM what the next 5 tokens should be, so the computations in the main LLM can directly influence 6 tokens of output.
I guess the "simple filter repeatedly deletes everything except its first token from the context window" was a bit unclear. I'll rewrite that.
What I wanted to say was, when the AI is talking to you (rather than talking to itself in its chain-of-thought), we want the AI to slow down, and do more computation for each token it outputs. In this case, we don't want it outputting many tokens for each forward pass. We want to only keep the first "high quality" token and delete the rest.
I don't think this is related to top-k/top-p generation, because that's referring to how an LLM samples one token from its distribution. k refers to the number of tokens considered not the number of tokens generated at once.
Thank you so much for reading and for the reply :)
One downside of an English chain-of-thought, is that each token contains only ≈17 bits of information, creating a tight information bottleneck.
Don't take my word for it, look at this section from a story by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo:
Each forward pass of the transformer, may compute 1000 times more information than one token. But all of that information is "deleted" from the point of view of the next forward pass (and the forward pass afterwards), the only information kept is a single token, or ≈17 bits of information.
A lot of information is stored by KV Caching, but that doesn't change the story.[1]
Neuralese recurrence gets around the information bottleneck, by letting the AI access the raw thoughts it had in previous computations, without collapsing each thought into a single token.
If the AI can access all the bits from its previous computations, there may be much fewer wasted computations. This assumes that more than ≈17 of those ≈10000 bits computed become useful, which is plausible given that reinforcement learning might improve their usefulness over time. This also assumes the AI can keep the faster growing context window organized.
Avoiding neuralese
Letting the AI use all the bits without collapsing them into an English token, may avoid the information bottleneck and improve efficiency, but it is terrifying from the point of view of interpretable chains-of-thought and scalable oversight.
We won't be able to read what the AI is thinking in plain English anymore.
So surely, there has to be some way to at least reduce the bottleneck, without converting the chain-of-thought into an utterly alien language?
Here is one idea:
Admittedly, even this method would reduce interpretability by making the chain of thought longer. This trade-off is more inevitable. But the trade-off where the AI has to learn some non-English alien language seems unnecessary.
Disclaimer: I have no formal computer science education.
Edit: my idea only states, "maybe to solve this information bottleneck problem, we should solve the generate-many-tokens-in-one-pass problem." It doesn't solve the generate-many-tokens-in-one-pass problem.
Transformers aren't allowed to think about information saved in their KV Caches. Transformer have ≈ the exact same thoughts they would have had without KV Caches, only faster.