- Until now ChatGPT dealt with audio through a pipeline of 3 models: audio transcription, then GPT-4, then text-to-speech. GPT-4o is apparently trained on text, voice and vision so that everything is done natively. You can now interrupt it mid-sentence.
- It has GPT-4 level intelligence according to benchmarks. 16-shot GPT-4o is somewhat better at transcription than Whisper (that's a weird comparison to make), and 1-shot GPT-4o is considerably better at vision than previous models.
- It's also somehow been made significantly faster at inference time. Might be mainly driven by an improved tokenizer. Edit: Nope, English tokenizer is only 1.1x.
- It's confirmed it was the "gpt2" model found at LMSys arena these past weeks, a marketing move. It has the highest ELO as of now.
- They'll be gradually releasing it for everyone, even free users.
- Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts, but have published no reports on this. "For now", audio output is limited to a selection of preset voices (addressing audio impersonations).
- The demos during the livestream still seemed a bit clanky in my opinion. Still far from naturally integrating in normal human conversation, which is what they're moving towards.
- No competitor of Google search, as had been rumored.
"Safety"-wise, they obviously haven't considered the implications of (a) trying to make it sound human and (b) having it try to get the user to like it.
It's extremely sycophantic, and the voice intensifies the effect. They even had their demonstrator show it a sign saying "I ❤️ ChatGPT", and instead of flatly saying "I am a machine. Get counseling.", it acted flattered.
At the moment, it's really creepy, and most people seem to dislike it pretty intensely. But I'm sure they'll tune that out if they can.
There's a massive backlash against social media selecting for engagement. There's a lot of worry about AI manipulation. There's a lot of talk from many places about how "we should have seen the bad impacts of this or that, and we'll do better in the future". There's a lot of high-sounding public interest blather all around. But apparently none of that actually translates into OpenAI, you know, not intentionally training a model to emotionally manipulate humans for commercial purposes.
Still not an X-risk, but definitely on track to build up all the right habits for ignoring one when it pops up...
I was a bit surprised that they chose (allowed?) 4o to have that much emotion. I am also really curious how they fine-tuned it to that particular state and how much fine-tuning was required to get it conversational. My naive assumption is that if you spoke at a merely-pretrained multimodal model it would just try to complete/extend the speech in one's own voice, or switch to another generically confabulated speaker depending on context. Certainly not a particular consistent responder. I hope they didn't rely entirely on RLHF.
It's especially strange cons... (read more)