Thanks. I think I get it now. (at least one of) my confusion was something between confusing a "transformer run" and "number of FLOPS".
And I get the thing about cost, that's what I meant but I articulated it poorly.
Got it, thanks!
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don't dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn't the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that's where the "more work" come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Thanks for the answer, I appreciate it!
Intuitively, it seems that output tokens should be more expensive. The autoregressive model has to run once for each output token, and as these runs progress, output tokens gradually become a part of the input (so the last token is generated with context being all input and almost all output).
I agree with the intuition, but I think that's where I am confused. Thanks to the KV cache we do not run the new input sequence (previous sequence + last generated token) through the encoders (as we do for the input sequence during prefill). It's all cached (from prefill + from the last token generation for that sequence+token). So... I don't know - it doesn't feel like the output tokens are more expensive in this case (you run "once", the same way as you run "once" for every input token)?
I think they do amortize their costs among all uses. A number of runs (number of output tokens) multiplied by a (varying) cost of the each run is unlikely to be close to linear.
Do you mind saying more about this? I am not sure what you mean. I.e. some pay more and some pay less (e.g. heavy hitters pay less while small prompters pay comparatively more per token?)
Even though some commenters mentioned some issues with the article, I really want to appreciate the attempt and being upfront with the estimates. It's very relevant for the thing I am now trying to figure out. As I have almost no intuitions about this except about some raw FLOPS, it pointed to important flaws my analysis would have. There are not many public sources that would explain that [are not a book or don't require me reading one-to-many to understand it]
Yes, but to defend (hehe) OP, he seems to be fully aware of that and addresses that explicitly in the linked article (which is also excellent, like this one):
In part because of those aforementioned stats on the frequency of guilty pleas, public defenders have garnered a reputation for being trial-averse, for pressuring clients to cop a plea just to keep the machine humming along. I think this reputation is ill-deserved. It’s completely counter to my own experience, at least, as few things are talked about with as much awed respect among one’s public-defender peers as the number of trials you have accumulated. It’s the functional equivalent of an attorney’s XP level.
Thanks for the feedback and the encouragement, I will incorporate these.
Btw. for questions 2-4 there is an intentional redundancy.
(slightly tangential) I think people are doing a terrible bucket error with competency, and that is that people overestimate how are others competent across all dimensions. I.e. it's often enough for a person to show great competency in providing vision, and we assume that person also needs to be great in leadership or management, and people are shocked it's not the case. Other examples:
Hey. I decided on a private school with more of a "democratic approach". I unfortunately wasn't able to find suitable tutors etc.
I am also trying to process what ChatGPT-like platforms will do with the landscape. E.g. my partner is using coding almost exclusively with ChatGPT and it's outstanding. Kids gonna follow IMHO.
Interesting, thanks!