- Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
- It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that's a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
- It's also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
- It's confirmed it was the "gpt2" model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
- They'll be gradually releasing it for everyone, even free users.
- Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. "For now", audio output is limited to a selection of preset voices (addressing audio impersonations).
- The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they're moving towards.
- No competitor of Google search, as had been rumored.
I would be shocked if this is the main driver, they claim that English only has 1.1x fewer tokens, but seem to claim much bigger speed-ups
Specifically, their claim is "2x faster, half the price, and has 5x higher rate limits". For voice, "232 milliseconds, with an average of 320 milliseconds" down from 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. I think there are people with API access who are validating this claim on their workloads, so more data should trickle in soon. But I didn't like seeing Whisper v3 being compared to 16-shot GPT-4o, that's not a fair comparison for WER, and I hope it doesn't catch on.
If you want to try it yourself you can use ELAN, which is the too... (read more)