Agree. If GPT-4 can solve 3-dim matrix multiplication with chain-of-thought, then doesn't that mean you could just take the last layer's output (before you generate a single token from it) and send it into other instances of GPT-4, and then chain together their output? That should by definition by enough "internal state-keeping" that you wouldn't need it to do the "note-keeping" of chain-of-thought. And that's precisely bayesed's point - because from the outside, that kind of a construct would just look like a bigger LLM. I think this is a clever post, but the bottleneck-ing created by token generation is too arbitrary of a way to assess LLM complexity.

1