From Simon Willison’s Weblog it seems to be about 260 tokens per frame, where each frame comes from one second of a video, and each of these frames is being processed the same as any image:

it looks like it really does work by breaking down the video into individual frames and processing each one as an image.

And at the end:

The image input was 258 tokens, the total token count after the response was 410 tokens—so 152 tokens for the response from the model. Those image tokens pack in a lot of information!

But these 152 tokens are just the titles and authors of the books. Information about the order, size, colors, textures, etc.... (read more)

5

0

Replying toVote on Interesting Disagreements

Nathan Simons2y

Vote on Interesting Disagreements

Irrefutable evidence of extraterrestrial life would be a good thing.

37

-1

•••

66

•••

66