LESSWRONG
LW

kotrfa
13115491
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
What the cost difference in processing input vs. output tokens with LLMs?
kotrfa1y30

Interesting, thanks!

Reply
What the cost difference in processing input vs. output tokens with LLMs?
kotrfa1y30

Thanks. I think I get it now. (at least one of) my confusion was something between confusing a "transformer run" and "number of FLOPS".

And I get the thing about cost, that's what I meant but I articulated it poorly.

Reply
What the cost difference in processing input vs. output tokens with LLMs?
kotrfa1y10

Heh, I actually think it's answered here.

Reply
What the cost difference in processing input vs. output tokens with LLMs?
kotrfa1y10

Got it, thanks!

But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don't dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn't the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that's where the "more work" come from?

Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?

Reply
What the cost difference in processing input vs. output tokens with LLMs?
kotrfa1y10

Thanks for the answer, I appreciate it!

Intuitively, it seems that output tokens should be more expensive. The autoregressive model has to run once for each output token, and as these runs progress, output tokens gradually become a part of the input (so the last token is generated with context being all input and almost all output).

I agree with the intuition, but I think that's where I am confused. Thanks to the KV cache we do not run the new input sequence (previous sequence + last generated token) through the encoders (as we do for the input sequence during prefill). It's all cached (from prefill + from the last token generation for that sequence+token). So... I don't know - it doesn't feel like the output tokens are more expensive in this case (you run "once", the same way as you run "once" for every input token)?

I think they do amortize their costs among all uses. A number of runs (number of output tokens) multiplied by a (varying) cost of the each run is unlikely to be close to linear.

Do you mind saying more about this? I am not sure what you mean. I.e. some pay more and some pay less (e.g. heavy hitters pay less while small prompters pay comparatively more per token?)

Reply
How much AI inference can we do?
kotrfa1y20

Even though some commenters mentioned some issues with the article, I really want to appreciate the attempt and being upfront with the estimates. It's very relevant for the thing I am now trying to figure out. As I have almost no intuitions about this except about some raw FLOPS, it pointed to important flaws my analysis would have. There are not many public sources that would explain that [are not a book or don't require me reading one-to-many to understand it]

Reply
My Clients, The Liars
kotrfa1y30

Yes, but to defend (hehe) OP, he seems to be fully aware of that and addresses that explicitly in the linked article (which is also excellent, like this one):

In part because of those aforementioned stats on the frequency of guilty pleas, public defenders have garnered a reputation for being trial-averse, for pressuring clients to cop a plea just to keep the machine humming along. I think this reputation is ill-deserved. It’s completely counter to my own experience, at least, as few things are talked about with as much awed respect among one’s public-defender peers as the number of trials you have accumulated. It’s the functional equivalent of an attorney’s XP level.

Reply
Help Needed: Crafting a Better CFAR Follow-Up Survey
kotrfa2y10

Thanks for the feedback and the encouragement, I will incorporate these.

Btw. for questions 2-4 there is an intentional redundancy.

Reply
The Competence Myth
kotrfa2y10

(slightly tangential) I think people are doing a terrible bucket error with competency, and that is that people overestimate how are others competent across all dimensions. I.e. it's often enough for a person to show great competency in providing vision, and we assume that person also needs to be great in leadership or management, and people are shocked it's not the case. Other examples:

  1. scientist being great in research, therefore they need to be great teachers or college management
  2. doctors are great in diagnosing, therefore they need to be great surgeons
  3. programmer being great in programming, therefore they need to be great in leading the rest of team or mentoring ...
Reply
Effective children education
kotrfa2y10

Hey. I decided on a private school with more of a "democratic approach". I unfortunately wasn't able to find suitable tutors etc.

I am also trying to process what ChatGPT-like platforms will do with the landscape. E.g. my partner is using coding almost exclusively with ChatGPT and it's outstanding. Kids gonna follow IMHO.

Reply
Load More
Sequences
10y
(+202)
43Unit economics of LLM APIs
1y
0
3What the cost difference in processing input vs. output tokens with LLMs?
Q
1y
Q
10
1Discord space for people with FTX clawbacks/claims request
2y
0
9Help Needed: Crafting a Better CFAR Follow-Up Survey
Q
2y
Q
2
49Effective children education
Q
5y
Q
31
14Epistea Workshop Series: Epistemics Workshop, May 2020, UK
6y
1
50Epistea Summer Experiment (ESE)
6y
3
2Tabletop Role Playing Game or interactive stories for my daughter
Q
6y
Q
10
2Meetup : Prague Less Wrong Meetup
10y
0
2Meetup : Lund Meetup
10y
1
Load More