- Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
- It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that's a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
- It's also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
- It's confirmed it was the "gpt2" model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
- They'll be gradually releasing it for everyone, even free users.
- Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. "For now", audio output is limited to a selection of preset voices (addressing audio impersonations).
- The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they're moving towards.
- No competitor of Google search, as had been rumored.
Specifically, their claim is "2x faster, half the price, and has 5x higher rate limits". For voice, "232 milliseconds, with an average of 320 milliseconds" down from 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. I think there are people with API access who are validating this claim on their workloads, so more data should trickle in soon. But I didn't like seeing Whisper v3 being compared to 16-shot GPT-4o, that's not a fair comparison for WER, and I hope it doesn't catch on.
If you want to try it yourself you can use ELAN, which is the tool used in the paper they cite for human response times. I think if you actually ran this test, you would find a lot of inconsistency with large differences between min vs max response time, average hides a lot vs a latency profile generated by e.g HdrHistogram. Auditory signals reach central processing systems within 8-10ms, but visual stimulus can take around 20-40ms, so there's still room for 1-2 OOM of latency improvement.
LLM inference is not as well studied as training, so there's lots of low hanging fruit when it comes to optimization (at first bottlenecked on memory bandwidth, post quantization, on throughput and compute within acceptable latency envelopes), plus there's a lot of pressure to squeeze out extra efficiency given constraints on hardware.
Llama-2 came out in July 2023, by September there were so many articles coming out on inference tricks I created a subreddit to keep track of high quality ones, though I gave up by November. At least some of the improvement is from open source code making it back into the major labs. The trademark for GPT-5 was registered in July (and included references to audio being built in), updated in February, and in March they filed to use "Voice Engine" which seems about right for a training run. I'm not aware of any publicly available evidence which contradicts the hypothesis that GPT-5 would just be a scaled up version of this architecture.