Do multimodal LLMs (like 4o) use OCR under the hood to read dense text in images?
SOTA multimodal LLMs can read text from images (e.g. signs, screenshots, book pages) really well. Are they actually using an internal OCR system, or do they learn to "read" purely through pretraining (like contrastive learning on image-text pairs)?
Jun 15, 20254