I do not get your argument here, it doesn't track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.
You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.
You quote the accurate statement that "LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet..."
Training, intrinsically, involves inference. Training USES inference. Training is simply optimizing the inference result by, as the above quote implies, "minimizing loss on [inference result]". Next-token prediction IS the inference result.
You can always ask the LLM, without having it store any state, to produce the next token, and do that again, and do that again, e.t.c. it doesn't have any plans, it is just using the provided input, performing statistical calculations on it, and producing the next token. That IS prediction. It doesn't have a plan, doesn't store a state, just using weights and biases(denoting the statistically significant ways of combining the input to produce a hopefully near-optimal output), and numbers (like the query, key, and value) denoting the statistical significance of the input text in relation to itself, and it predicts, through that statistical process, the next token. It doesn't have a global plan.
I do not get your argument here, it doesn't track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.
You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.
You quote the accurate statement that "LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet..."
Training, intrinsically, involves inference. Training USES inference. Training is simply optimizing the inference result by, as the above quote implies, "minimizing loss on [inference result]". Next-token prediction IS the inference result.
You can always ask the LLM, without having it store any state, to produce the next token, and do that again, and do that again, e.t.c. it doesn't have any plans, it is just using the provided input, performing statistical calculations on it, and producing the next token. That IS prediction. It doesn't have a plan, doesn't store a state, just using weights and biases(denoting the statistically significant ways of combining the input to produce a hopefully near-optimal output), and numbers (like the query, key, and value) denoting the statistical significance of the input text in relation to itself, and it predicts, through that statistical process, the next token. It doesn't have a global plan.