OpenAI previously argued that they called ChatGPT ChatGPT and not a name like Siri, Cortana, or Alexa to help the user be aware that they are talking to an AI instead of a regular human. Sam Altman argued that this is a safety thing that they do. It likely reduces people falling in love with the AI and thus doing things that the AI tells them that aren't a good idea.
Making the choice to use a celebrity choice like Scarlett Johansson violated that safety principle that OpenAI previously professed to have.
While this isn't the most important safety principle, if they violate safety principles they profess to have for shallow reasons that makes it unlikely they will stick to any more important safety principle when there actually a huge inventive to break the safety principle.
Scarlett says:
He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and Al.
If that's true, then OpenAI wants to essentially emotionally manipulate people to be less cautious with AI than they would be naturally inclined to be.
That’s very true. I remember seeing Sam talk in Melbourne a year ago when he was on the “world tour”. Talking about people getting emotionally attached to or using GPT for therapy made him clearly uncomfortable. Or, that’s what he seemed to be signaling. I really did believe that it made him incredibly squeamish.
The Big Rule Adjustment
This section is really the timeless part of the post and comes much too late or maybe rather deserves its own post. I'd like to see a link to where Zvi has "said this before."
I'd like to see Zvi go through all his updates from the last ~year and pull out all the timeless bits into a themed post or two.
Seconded. I feel like much more of what I've seen before has taken the form of "no, we're not trying to target AI with ad-hoc changes to liability law/copyright, we're just trying to consistently apply the rules that already apply to people," which is rather in tension with this section.
Great article.
The second rule is to ask for permission.
Is this supposed to be "The second rule is to ask for forgiveness."?
I repeat. Do not mess with Scarlett Johansson.
You would think her movies, and her suit against Disney, would make this obvious.
Apparently not so.
You see, there was this voice they created for GPT-4o, called ‘Sky.’
People noticed it sounded suspiciously like Scarlett Johansson, who voiced the AI in the movie Her, which Sam Altman says is his favorite movie of all time, which he says inspired OpenAI ‘more than a little bit,’ and then he tweeted “Her” on its own right before the GPT-4o presentation, and which was the comparison point for many people reviewing the GPT-4o debut?
Quite the Coincidence
I mean, surely that couldn’t have been intentional.
Oh, no.
OpenAI reports on how it created and selected its five selected GPT-4o voices.
Again: Do not mess with Scarlett Johansson. She is Black Widow. She sued Disney.
Several hours after compiling the above, I was happy to report that they did indeed mess with Scarlett Johansson.
She is pissed.
NPR then published her statement, which follows.
Scarlett Johansson’s Statement
Sure Looks Like OpenAI Lied
This seems like a very clear example of OpenAI, shall we say, lying its ass off?
They say “we believe that AI voices should not deliberately mimic a celebrity’s distinctive voice,” after Sam Altman twice personally asked the most distinctive celebrity possible to be the very public voice of ChatGPT, and she turned them down. They then went with a voice this close to hers while Sam Altman tweeted ‘Her,’ two days after being turned down again. Mira Mutari went on stage and said it was all a coincidence.
Uh huh.
And it seems they’re doubling down on this, with carefully worded statements that don’t really get to the heart of the matter:
I assume not that fourth one. Heaven help OpenAI if they did that.
Here is one comparison of Scarlett talking normally, Scarlett’s voice in Her and the Sky voice. The Sky voice sample there was plausibly chosen to be dissimilar, so here is another longer sample in-context, from this OpenAI demo, that is a lot closer to my eears. I do think you can still tell the difference between Scarlett Johansson and Sky, but it is then not so easy. Opinions differ on exactly how close the voices were. To my ears, the sample in the first clip sounds more robotic, but in the second clip it is remarkably close.
No one is buying that this is a coincidence.
Another OpenAI exec seems to have misled Nitasha Tiku.
Even if we take OpenAI’s word for absolutely everything, the following facts do not appear to be in dispute:
So, yeah.
Sure Seems Like OpenAI Violated Their Own Position
On March 29, 2024, OpenAI put out a post entitled Navigating the Challenges and Opportunities of Synthetic Voices (Hat tip).
They said this, under ‘Building Voice Engine safely.’ Bold mine:
If I was compiling a list of voices to check in this context that were not political figures, Scarlett Johansson would not only have been on that list.
She would have been the literal first name on that list.
For exactly the same reason we are having this conversation.
GPT-4o did not factor in Her, so it put her in the top 100 but not top 50, and even with additional context would only have put her in the 10-20 range with the Pope, the late Queen and Taylor Swift (who at #15 was the highest non-CEO non-politician.)
Remember that in September 2023, a journalist asked an OpenAI executive about Sky and why it sounded so much like Scarlett Johansson.
Even if this somehow was all an absurd coincidence, there is no excuse.
Altman’s Original Idea Was Good, Actually
Ultimately, I think that the voices absolutely should, when desired by the user, mimic specific real people’s voices, with of course that person’s informed consent, participation and financial compensation.
I should be able to buy or rent the Scarlett Johansson voice package if I want that and she decides to offer one. She ideally gets most or all of that money. Everybody wins.
If she doesn’t want that, or I don’t, I can go with someone else. You could buy any number of them and swap between them, have them in dialogue, whatever you want.
You can include a watermark in the audio for deepfake detection. Even without that, it is not as if this makes deepfaking substantially harder. If you want to deepfake Scarlett Johansson’s voice without her permission there are publically available tools you can already use to do that.
This Seems Like a Really Bad Set of Facts for OpenAI?
Once could even say the facts went almost maximally badly, short of an outright deepfake.
The second rule is to ask for forgiveness.
Whoops, on both counts.
Also it seems they lied repeatedly about the whole thing.
That’s the relatively good scenario, where there was no outright deepfake, and her voice was not directly used in training.
Does Scarlett Johansson Have a Case?
I am not a lawyer, but my read is: Oh yes. She has a case.
A jury would presumably conclude this was intentional, even if no further smoking guns are found in discovery. They asked Scarlett Johansson twice to participate. There were the references to ‘Her.’
There is no fully objective way to present the facts to an LLM, your results may vary, but when I gave GPT-4o a subset of the evidence that would be presented by Scarlett’s lawyers, plus OpenAI’s claims it was a coincidence, GPT-4o put the probability of a coincidence at under 10%.
It all seems like far more than enough for a civil case, especially given related public attitudes. This is not going to be a friendly jury for OpenAI.
If the voice actress was using her natural voice (or the ‘natural robotization’ thereof) without any instructions or adjustments that increased the level of resemblance, and everyone was careful not to ever say anything beyond what we already know, and the jury is in a doubting mood? Even then I have a hard time seeing it.
If you intentionally imitate someone’s distinctive voice and style? That’s a paddlin.
Then there’s the classic case Midler v. Ford Motor Company. It sure sounds like a direct parallel to me, down to asking for permission, getting refused, doing it anyway.
If it has come to this, so be it.
Genius. Also, I’d take it. A win is a win.
What Would It Mean For There Not To Be a Case?
There are some people asking what the big deal is, ethically, practically or legally.
In legal terms, my most central observation is that those who don’t see the legal issue mostly are unaware of the relevant prior case law listed above due to being unwilling to Google for it or ask an LLM.
I presume everyone agrees that an actual direct deepfake, trained on the voice of Scarlett Johansson without consent, would be completely unacceptable.
The question some ask is, if it is only a human that was ‘training on the voice of Scarlett Johansson,’ similar to the imitators in the prior cases, why should we care? Or, alternatively, if OpenAI searched for the closest possible match, how is that different from when Padme is not available for a task so you send out a body double?
The response ‘I never explicitly told people this was you, fine this is not all a coincidence, but I have a type I wanted and I found an uncanny resemblance and then heavily dropped references and implications’ does not seem like it should work here? At least, not past some point?
Obviously, you are allowed to (even if it is kind of creepy) date someone who looks and sounds suspiciously like your ex, or (also creepy) like someone who famously turned you down, or to recast a voice actor while prioritizing continuity or with an idea of what type of voice you are looking for.
It comes down to whether you are appropriating someone’s unique identity, and especially whether you are trying to fool other observers.
The law must also adjust to the new practicalities of the situation, in the name of the ethical and practical goals that most of us agree on here. As technology and affordances change, so must the rules adjust.
In ethical and practical terms, what happens if OpenAI is allowed to do this while its motivations and source are plain as day, so long as the model did not directly train on Scarlett Johansson’s voice?
You do not need to train an AI directly on Scarlett’s voice to get arbitrarily close to Scarlett’s voice. You can get reasonably close even if all you have is selection among unaltered and uncustomized voices, if you have enough of a sample to choose from.
If you auditioned women of similar age and regional accent, your chances of finding a close soundalike are remarkably good. Even if that is all OpenAI did to filter initial applications, and then they selected the voice of Sky to be the best fit among them, auditioning 400 voices for 5 slots is more than enough.
I asked GPT-4o what would happen if you also assume professional voice actresses were auditioning for this role, and they understood who the target was. How many would you have to test before you were a favorite to find a fit that was all but indistinguishable?
One. It said 50%-80% chance. If you audition five, you’re golden.
Then the AI allows this voice to have zero marginal cost to reproduce, and you can have it saying absolutely anything, anywhere. That, alone, obviously cannot be allowed.
Remember, that is before you do any AI fine-tuning or digital adjustments to improve the match. And that means, in turn, if you did use an outright deepfake or you did fine-tuning on the closeness of match or used it to alter parameters in post, unless they can retrace your steps who is to say you did any of that.
If Scarlett Johansson does not have a case here, where OpenAI did everything in their power to make it obvious and she has what it takes to call them on it, then there effectively are very close to no rules and no protections, for creatives or otherwise, except for laws against outright explicitly claimed impersonations, scams and frauds.
The Big Rule Adjustment
As I have said before:
Many of our laws and norms will need to adjust to the AI era, even if the world mostly ‘looks normal’ and AIs do not pose or enable direct existential or catastrophic risks.
Our existing laws rely on friction, and on human dynamics of norm enforcement. They and their consequences are designed with the expectation of uneven enforcement, often with rare enforcement. Actions have practical costs and risks, most of them very different from zero, and people only have so much attention and knowledge and ability to execute and we don’t want to stress out about all this stuff. People and corporations have reputations to uphold and they have to worry about unknown unknowns where there could be (metaphorical) dragons. One mistake can land us or a company in big trouble. Those who try to break norms and laws accumulate evidence, get a bad rep and eventually get increasingly likely to be caught.
In many places, fully enforcing the existing laws via AI and AI-enabled evidence would grind everything to a halt or land everyone involved in prison. In most cases that is a bad result. Fully enforcing the strict versions of verbally endorsed norms would often have a similar effect. In those places, we are going to have to adjust.
Often we are counting on human discretion to know when to enforce the rules, including to know when a violation indicates someone who has broken similar rules quite a lot in damaging ways versus someone who did it this once because of pro-social reasons or who can learn from their mistake.
If we do adjust our rules and our punishments accordingly, we can get to a much better world. If we don’t adjust, oh no.
Then there are places (often overlapping) where the current rules let people get away with quite a lot, often involving getting free stuff, often in a socially damaging way. We use a combination of ethics and shame and fear and reputation and uncertainty and initial knowledge and skill costs and opportunity costs and various frictions to keep this at an acceptable level, and restricted largely to when it makes sense.
Breaking that equilibrium is known as Ruining It For Everyone.
A good example would be credit card rewards. If you want to, you can exploit various offers to make remarkably solid money opening and abusing various cards in various ways, and keep that going for quite a while. There are groups for this. Same goes for sportsbook deposit bonuses, or the return policies at many stores, and so on.
The main reason that often This Is Fine is that if you are sufficiently competent to learn and execute on such plans, you mostly have better things to do, and the scope on any individual’s actions are usually self-limiting (when they aren’t you get rules changes and hilarious news stories.) And what is lost to such tricks is made up for elsewhere. But if you could automate these processes, then the scope goes to infinity, and you get rules changes and ideally hilarious (but often instead sad) news articles. You also get mode collapses when the exploits become common knowledge or too easy to do, and norms against using them go away.
Another advantage is this is often good price discrimination gated by effort and attention, and an effective subsidy for the poor. You can ‘work the job’ of optimizing such systems, which is a fallback if you don’t have better opportunities, and you are short on money but long on time or want to train optimization or pull one over.
AI will often remove such frictions, and the barriers preventing rather large scaling.
AI voice imitation is one of those cases. Feature upgrades, automation, industrialization and mass production change the nature of the beast. This particular case was one that was already illegal without AI because it is so brazen and clear cut, but we are going to have to adjust our rules to the general case.
The good news is this is a case where the damage is limited, so ‘watch for where things go wrong and adjust’ should work fine. This is the system working.
The bad news is that this adjustment cannot involve ‘stop the proliferation of technology that allows voice cloning from remarkably small samples.’ That technology is essentially mature already, and open solutions available. We cannot unring the bell.
In other places, where the social harms can scale to a very high level, and the technological bell once rung cannot be easily unrung, we have a much harder problem. That is a discussion for another post.
The Internet Reacts
As noted above, there was a faction that said this was no big deal, or even totally fine.
Most people did not see it that way. The internet is rarely as united as this.
Do not underestimate the power of a beloved celebrity that is on every level a total badass, horrible publicity and a united internet.
There’s even some ethics people out there to explain other reasons this is problematic.
Finally, for your moment of zen: The Daily Show has thoughts on GPT-4o’s voice.