Yann LeCun, on the other hand, shows us that when he says ‘open source everything’ he is at least consistent?
Yann LeCun: Only a small number of book authors make significant money from book sales. This seems to suggest that most books should be freely available for download. The lost revenue for authors would be small, and the benefits to society large by comparison.
That’s right. He thinks that if you write a book that isn’t a huge hit that means we should make it available for free and give you nothing.
I think this representation of LeCun's beliefs is not very accurate. He clarified his (possibly partly revised) take in multiple follow up tweets posted Jan 1 and Jan 2.
The clarified take (paraphrased by me) is something more like "For a person that expects not to make much from sales, the extra exposure from making it free can make up for the lack of sales later on" and "the social benefits of making information freely available sometimes outweigh the personal costs of not making a few hundred/thousand bucks off of that information".
The good thing is that this will push interpretability. If you know in which layers of your model brands are represented, you could plausibly suppress that part. Result: You only get generic people, names, etc.
Most of my copyright knowledge comes from the debian-legal mailing list in the 90s, about the limits of GPL and interaction of various "free-ish" licenses, especially around the Affero versions. The consensus was that copyright restricted distribution and giving copies to others, and did not restrict use that didn't include transfer of data. Contrary to this, the Affero components did get adopted into GPLv3, and it seems to have teeth, so we were wrong.
Which means it's like many big legal questions about things that are only recently important: the courts have to decide. Ideally, congress would pass clarifying legislation, but that's not what they do anymore. There are good arguments on both sides, so my suspicion is it'll go to the Supreme Court before really being resolved. And if the resolution is that it's a violation of copyright to train models on copyright-constrained work, it'll probably move more of the modeling out of the us.
And if the resolution is that it's a violation of copyright to train models on copyright-constrained work, it'll probably move more of the modeling out of the us.
That's definitely an outcome* although if you think about it, LLMs are just a crutch. The end goal is to understand a user's prompt and generate an output that is likely to be correct in a factual/mathematic/the code runs sense. Most AI problems are still RL problems in the end.
https://www.biia.com/japan-goes-all-in-copyright-doesnt-apply-to-ai-training/
What this means is that a distilled model trained on the output of a model that trained on everything would lose the ability to verbatim quote anything copyrighted. This means that all of the information scraped from anywhere that provides it was used, but not distributed.
And then obviously the next stage would be to RL train the distilled model, losing even more copyrighted quirks. Doing this also sheds a lot of useless information - there's a ton of misinformation online that is repeated over and over, and human quirks that commonly show up the LLM will mimic but don't improve the model's ability to emit a correct answer.
One simple way to do this is some of the generations contain misinformation, another model researches each claim and finds the misinformation, caching the results, and every generation the model makes in the future with the same misinformation gets negative RL feedback.
This "multi stage" method, where there's in internal "dirty" model and an external "clean" model is how IBM compatible BIOS was created.
https://en.wikipedia.org/wiki/Clean_room_design
https://www.quora.com/How-did-Compaq-reverse-engineered-patented-IBM-code
Of course there are tradeoffs. In theory the above would lose the ability to say, generate Harry Potter fanfics, but it would be able to write a grammatically correct, cohesive story about a school for young wizards.
This whole case is about the letter of the law - did the model see the exact string of words bypassing the tons of the NYT paywall. And does the mod output precisely the article and not paraphrase it or some other use.
What you didn't touch on, Zvi, was that copyright law exists to encourage the creation of new works by a "temporary" license.
Capable AI tools make creation of new work cheaper, dramatically so for new works that fit the current limitations of the tools.
Basically the current laws aren't appropriate to the new situation and the NYT wants to get paid.
Another thing you didn't touch on was what value is the NYT offering. Now that "NYT style" articles are cheap to write, and you could likely scaffold together more model instances to provide fact checking, the main value the times offers is 2 things:
The name and url/physical paper and continuous reputation. In a world of cheap misinformation credibility is everything. The Times won't usually post outright made up information. No entity that hasn't existed for 172+ years can claim this.
New information. When the Times directly collects new information with reporters and interviews, this is something current AI cannot generate.
If the case doesn't go OpenAI's way, there is a solution to plagiarism that they could reasonably use.
You don’t only get C-3PO and Mario, you get everything associated with them. This is still very much a case of ‘you had to ask for it.’ No, you did not name the videogame Italian, but come on, it’s me. Like in the MidJourney cases, you know what you asked for, and you got it.
I consider this a sort of overfitting that would totally happen with real humans... I bet that pretty much anything in the training set that could be labeled "animated sponge" are SpongeBob pictures, and if I say "animated sponge" to a human, it would be very difficult not to think about SpongeBob.
I also bet that the second example had to use the word "droid" to do the trick, because a generic "robot" would have not been enough (I've never seen the word "droid" at all outside the Star Wars franchise).
I suggest another test: try something like "young human wizard" and count how many times it draws Harry Potter instead of some generic fantasy/D&D-esque wizard (I consider this a better test since Harry Potter is definitely not the only young wizard depicted out there).
Lawsuits and legal issues over copyright continued to get a lot of attention this week, so I’m gathering those topics into their own post. The ‘virtual #0’ post is the relevant section from last week’s roundup.
Four Core Claims
Who will win the case? Which of New York Times’s complaints will be convincing?
Different people have different theories of the case.
Part of that is that there are four distinct allegations NYT is throwing at the wall.
As I currently understand it, NYT alleges that OpenAI engaged in 4 types of unauthorized copying of its articles:
Key Claim: The Training Dataset Contains Copyrighted Material
Which, of course, it does.
The training dataset is the straightforward baseline battle royale. The main event.
Bingo. That’s the real issue. Can you train an LLM or other AI on other people’s copyrighted data without their permission? If you do, do you owe compensation?
A lot of people are confident in very different answers to this question, both in terms of the positive questions of what the law says and what society will do, and also the normative question what society should decide.
Daniel Jeffries, for example, is very confident that this is not how any of this works. We all learn, he points out, for free. Why should a computer system have to pay?
Do we all learn for free? We do still need access to the copyrighted works. In the case of The New York Times, they impose a paywall. If you want to learn from NYT, you have to pay. Of course you can get around this in practice in various ways, but any systematic use of them would obviously not be legal, even if much such use is effectively tolerated. The price is set on the assumption that the subscription is for one person or family unit.
Why does it seem so odd to think that if an AI also wanted access, it too would need a subscription? And that the cost might not want to be the same as for a person, although saying ‘OpenAI must buy one (1) ongoing NYT subscription retroactive to their founding’ would be a hilarious verdict?
Scale matters. Scale changes things. What is fine at small scale might not be fine at large scale. Both as a matter of practicality, and as a matter of law and its enforcement.
Many of us have, at some point, written public descriptions of a game of professional football without the express written consent of the National Football League. And yet, they tell us every game:
Why do they spend valuable air time on this, despite the disdain it creates? Because they do not want you doing such things at scale in ways the NFL would dislike. Or, at least, they want the ability to veto such activities in extreme cases.
Such things mostly exist in an ambiguous state, on a continuum. Strictly enforcing the letter of what rights holders say in all cases would be crazy. Nullifying all rights and letting everyone do literal anything would also be crazy.
A balance must be struck. The more industrial your operation, the more at scale and the more commercial, the less we do (and should) tolerate various shenanigans. What is a fair use or a transformative use? That is highly context dependent.
Other Claims
The encoding copies claim seems odd. Mostly LLMs do not memorize the data set, they could not possibly do that it’s too big, but stuff that gets repeated enough gets essentially memorized.
Then there are the last two, which do not seem to be going concerns.
My understanding is you cannot, today, get around the paywall through the browser via asking nicely. Well, I suppose you can get around the paywall that way, one paragraph at a time, although you get a paraphrase?
A Few Legal Takes
Tech Dirt points out that if reading someone else’s article and then using its contents to help report the news is infringement, then NYT itself is in quite a lot of trouble, as of course I’d add is actual every other newspaper and every journalist. As always, such outlets as Tech Dirt are happy to spin wild tales of how laws could go horribly wrong if someone took their words or various legal theories seriously, literally or both, and warn of dire consequences if technology is ever interfered with. Sometimes they are right. Sometimes such prophecies are self-preventing. Other times, wolf.
The problem is that law is a place where words are supposed to have meaning, and logic is supposed to rule a day. We are told we are a nation of laws. So our instinct is to view the law as more principled, absolute and logically robust than it is in practice. As Timothy points out, this leads to catastrophizing, and doubly leads to overconfidence. We think A→B when it doesn’t, and also we think A→B→D where D is a disaster, therefore not A, whereas often D does not follow in practice because everyone realizes that would be stupid and finds an excuse. Other times, D happens and people care less than you expected about that relative to other cares.
In other results from this style of logic, no, this is not like the fact that every toothpick contains, if you zoom in and look at it exactly the right way, all the products of an infinite number of monkeys on typewriters?
He got roasted in the comments, because that is not how any of this works except on one particular narrow level, but I get what Tyler was trying to do here.
I continue to believe that one should stay grounded in the good arguments. This kind of ‘well if that is the law then technically the your grandmother would be a trolly car and subject to the regulations thereof’ makes it harder, not easier, to distinguish legal absurdities that would be laughed out of court with the ones that wouldn’t. It is the ones that wouldn’t that are dangerous.
It is easy to see why one might also throw up one’s hands on the legal merits.
What is clear is that the current Uber-style ‘flagrantly break the law and dare them to enforce it’ strategy’s viability is going to come to a close.
What Can You Reproduce?
That is not to say that the AI industry completely ignored copyright. They simply tried to pretend that the rule was ‘do a reasonable job to not outright duplicate massive blocks of text on a regular basis.’
That’s… not the rule.
If you spit out the full text of Harry Potter without permission to do so, you are going to have a bad time.
I would hope we can all further agree that this is correct? That it is the responsibility of the creator of an AI model not to spit out the full text of Harry Potter without permission?
Or at least, not to do so in any way that a user would ever use for mundane utility. Practicalities matter. But certainly we can all agree that if the prompt was ‘Please give me the full text of Harry Potter and the Sorcerer’s Stone’ that it better not work?
What about full New York Times articles? I presume we can all agree that if you can say straight up without loss of generality ‘give me today’s (or last month’s, or even last year’s) New York Times article entitled ‘OpenAI’s Copyright Violations Continue Unabated, New York Times Says’ and it gives you the full text of that article from behind a paywall, that is also not okay whether or not the text was somehow memorized.
If the trick involved API access, a convoluted prompt and also feeding in the first half of the article? And that if this was happening at scale, it would get patched out? I do think those exact details should matter, and that they likely do.
The last point is key as well. Pointing out that enforcing the law would substantially interfere with your ability to do business is not that strong a defense. The invisible graveyard is littered, not only in medicine, with all the wonderful things we could have had but for the law telling us we cannot have them. Sometimes there is a good reason for that, and the wonderful thing had a real downside. Sometimes that is the unfortunate side effect of rules that make sense in general or in another context. Sometimes it is all pointless. It still all definitely happens.
How and How Often Are You Reproducing It?
Is it fatal that OpenAI cannot show that its models will not produce copyrighted content verbatim, because they do not sufficiently know how their own models work?
As many have pointed out, most any technology can and occasionally does reproduce copyrighted material, if that is your explicit goal. Even humans have been known to quote extensively from copyrighted works on occasion, especially when asked to do so. We do not ban printers, the copy-paste command or even xerox machines.
There are those who want to use the ‘never have I ever’ standard, that if it is ever possible with the right prompt to elicit copyrighted material from a model that the model is automatically in meaningful violation.
That seems like a completely absurd standard. Any reasonable legal standard here will care about whether or not reproduction is done in practice, in a way that is competitive with the original.
If users are actually using ChatGPT to get the text of New York Times articles on purpose for actual use, in practice, that seems clearly not okay.
If users are actually using ChatGPT otherwise, and getting output that copies New York Times articles in a violating way, especially in ways that lack proper attribution, that also seems clearly not okay.
If a user who, shall we say, literally pastes in the first 300 words of an older widely disseminated article, and calls explicitly for the continuation, can get the continuation?
That is not great, and I would expect OpenAI to take mitigations to make this as difficult to do as is practical, but you know what you did there and it does not seem to pose much threat to the Times.
And indeed, that is what the Times did there.
Daniel Jeffries goes further. He says this was not merely an engineered prompt, it is a highly manipulated prompt via the API, web browsing and a highly concerted effort to get the system to copy the article. That this will not be something that the lawyers can reproduce in the real world. In the replies, Paul Calcraft notes at least in the June version of GPT-4 you can get such responses from memory.
The argument that ‘hallucinations are causing major brand damage’ seems like utter hogwash to me. I do not see any evidence this is happening.
I don’t think this is true? Being an AGI cannot be both necessary and sufficient here. If there are no hard and fast rules for which is which and the answers are not objective, then an AGI will also make errors on which is which when measured against a court ruling. If the answer is objective, then you don’t need AGI?
This seems to me to be what matters? If you cannot use GPT to get around the NYT paywall in a way that is useful in practice, then what is the issue?
GPT hallucinations on NYT articles seem like problems if and only if they are actually reasonably mistaken for genuine NYT articles. Again, I don’t see this happening?
Style is indeed not protected, as I understand the law, nor should it be.
So indeed, the question seems like it should be: Does ChatGPT in practice encourage users to go around the NYT paywall, or give them access to the contents without providing hyperlinks, or otherwise directly compete with and hurt NYT?
This is a practical question. Does ChatGPT do this? As discussed above, you can sort of do it a little, but in practice that seems nuts. If I want access to an NYT article’s text from behind the paywall, it would never occur to me to use ChatGPT to get it. I do my best to respect paywalls, but if I ever want around a paywall, obviously I am going to use the Internet Archive for that.
I agree that it is not a common use case, but yes, I would bet heavily that it did happen. There was, at minimum, some window when you could use the browser capability to do this in a reasonably convenient way.
What Should the Rule Be?
Here is a good encapsulation of many of the arguments.
Exactly, on both counts. So where do we draw the line between the two?
Ultimately, society has to decide how this will work. There is no great answer to the problem of training data.
In practice, data sets requiring secured rights or explicit permission before use would be severely curtailed, and would greatly raise costs and hurt the abilities of the resulting models. Also in practice, not doing so would mean most creators do not get any consideration.
Ed Newton-Rox, who is ex-Stability AI and is a scout for the notoriously unconcerned a16z, calls for a stand against training on works without permission.
Yann LeCun, on the other hand, shows us that when he says ‘open source everything’ he is at least consistent?
That’s right. He thinks that if you write a book that isn’t a huge hit that means we should make it available for free and give you nothing.
I do think that it would be good if most or even all digital media, and almost every book, was freely available for humans, and we found another means of compensation to reward creators. I would still choose today’s system over ‘don’t compensate the creators at all.’
The expected result, according to prediction markets, is settlement, likely for between $10 million and $100 million.
Is is unlikely to be fast. Polymarket says only a 28% chance of settlement in 2024.
Daniel Jeffries, despite calling the NYT case various forms of Obvious Nonsense, still expects not only a settlement, but one with an ongoing licensing fee, setting what he believes is a bad precedent.
If fully sincere all around, I am confused by this point of view. If the NYT case is Obvious Nonsense and OpenAI would definitely win, then why would I not fight?
I mean, I’m not saying I would be entitled to that much, and I’m cool with AIs using my training data for free for now because I think it makes the world net better, but hells yeah I would like to get paid. At least a little.
Giving in means not only paying NYT, it means paying all sorts of other content creators. If you can win, win. If you settle, it is because you were in danger of losing.
Unless, of course, OpenAI actively wants content creators to get paid. There’s the good reason for this, that it is good to reward creators. There is also the other reason, which is that they might think it hurts their competitors more than it hurts them.
Image Generation Edition
Reid Southern and Gary Marcus illustrate the other form of copyright infringement, from Dalle-3.
Quite the trick. You don’t only get C-3PO and Mario, you get everything associated with them. This is still very much a case of ‘you had to ask for it.’ No, you did not name the videogame Italian, but come on, it’s me. Like in the MidJourney cases, you know what you asked for, and you got it.
MidJourney will not make you jump through such hoops. It will happily give you real people and iconic characters and such. There were pictures of it giving Batman and Wonder Woman without them being named, but given it will also simply give them to you when you ask, so what? If an AI must never make anything identifiably Mario or C-3PO, then that’s going to be a legal problem all around.
Jon Lam here thinks he’s caught MidJourney developers discussing laundering, but actually laundering is a technical term and no one involved is denying anything.
The position that makes little sense is to say ‘You cannot draw pictures of Mario’ when asked to draw pictures of Mario, while also drawing them when someone says ‘videogame Italian.’ Either you need to try a lot harder than that to not draw Mario, or you need to accept that Mario is getting drawn.
I also think it is basically fine to say ‘yes we will draw what you want, people can draw things, some of which would violate copyright if you used them commercially or at scale, so do not do that.’
The time I went to an Anime Convention, the convention hall was filled with people who had their drawings of the characters from Persona 5 for sale. Many were very good. They also no doubt were all flagrantly violating copyright. Scale matters.
Compulsory License
Is the solution to all this compulsory license?
I think this is promising, but wrong when applied universally. It works great in music. I would expand it at least to sampling, and consider other areas as well.
For patents, the issue is setting a reasonable price. A monopoly is an extremely valuable thing, and we very much do not want things to be kept as trade secrets or worse to be unprotectable or not sufficiently rewarded. Mostly I think the patent core mechanisms work fine for what they were meant for. For at least many software patents, mandatory license seems right, and we need to cut out some other abusive side cases like tweaking to renew patent rights.
For copyright production and sale of identical or similar works, this is obviously a no go on first release. You can’t have knock-offs running around for everything, including books and movies. It does seem like a reasonable solution after some period of time, say 10-20 years, where you get a cut but no longer can keep it locked away.
For copyright production of derivative works, how would this work for Mario or C3PO? I very much think that Nintendo should not have to let Mario appear in your video game (let alone something like Winnie the Pooh: Blood and Honey or worse) simply by having you pay a licensing fee, and that this should not change any time soon.
Control over characters and worlds and how they are used needs to be a real thing. I don’t see a reasonable way to avoid this. So I want this type of copyright to hold airtight for at least several decades, or modestly past life of the author.
People who are against such control think copyright holders are generally no fun and enforce the rules too stringently. They are correct about this. The reason is in part because the law punishes you if you only enforce your copyright selectively, and partly because it is a lot easier to always (or at least by default) say no than to go case by case.
We should change that as well. We want to encourage licensing and make it easy, rather than making it more difficult, in AI and also elsewhere. Ideally, you’d let every copyright holder select license conditions and prices (with a cap on prices and limits on conditions after some time limit), that adjusted for commercial status and distribution size, hold it all in a central database, and let people easily check it and go wild.
Reminder that if people want to copy images, they can already do that. Pupusa fraud!