Should we push for requiring AI training data to be licensed?

ChristianKl

37 Should we push for requiring AI training data to be licensed?

19th Oct 2022

1 min read

37

We frequently speak about AI capability gain being bad because it shortens the timeframe for AI safety research. In that logic, taking steps to decrease AI capability would be worthwhile.

At the moment the large language models are trained with a lot of data without the company, that trains the language model, licensing the data. If there would be a requirement to license the required data, that would severely reduce the available data for language models and reduce their capabilities.

It's expensive to fight lawsuits in the United States. Currently, there are artists who feel like their rights are violated by Dalle 2 using their art as training data. Similar to how Thiel funded the Gawker lawsuits, it would be possible to support artists in a suit against OpenAI to require OpenAI to license images for training Dalle 2. If such a lawsuit is well-funded it will be much more likely that a precedent for requiring data licensing gets set which would slow down AI development.

I'm curious about what people who think more about AI safety than myself think about such a move. Would it be helpful?

AI GovernanceAI

Frontpage

37

New Answer

New Comment

6 Answers sorted by
top scoring

Zach Furman

Oct 19, 2022

166

I'm not sure exactly where I land on this, but I think it's important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it's worth thinking about.

[-]Jozdien3y32

I'm leaning toward the current paradigm being preferable to a full-fledged RL one, but want to add a point - one of my best guesses for proto-AGI involves massive LLMs hooked up to some RL system. This might not require RL capabilities on the same level of complexity as pure RL agents, and RL is still being worked on today.

2dkirmani2y

Agree, but LLM + RL is still preferable to muzero-style AGI.

2Jozdien2y

I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.

Dave Orr

Oct 19, 2022

I'm not a lawyer and this is not legal advice, but I think the current US legal framework isn't going to work to challenge training on publicly available data.

One argument that something is fair use is that it is transformative [1]. And taking an image or text and using it to slightly influence a giant matrix of numbers, in such a way that the original is not recoverable, and which allows new kinds of expression, seems likely to count as transformative.

So if you think that restricting access to public data for training purposes is a promising approach [2], you should probably focus on trying to create a new regulatory framework.

Having said that, this is all US analysis. Other countries have other frameworks and may not have exact analogs of fair use. Perhaps in the EU legal challenges are more viable.

[1] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html

[2] You should think about what the side effects would be like. For instance, this will advantage giant companies that can pay to license or create data, and places that have low respect for law. Whether that's desirable is worth thinking through.

[-]ChristianKl3y20

From the overview, it seems like there's one Supreme court case and the other cases are from lower courts.

Transformativeness is only one of the factors in the Supreme court case.

The supreme court case does say:

As to the music, this Court expresses no opinion whether repetition of the bass riff is excessive copying, but remands to permit evaluation of the amount taken, in light of the song's parodic purpose and character, its transformative elements, and considerations of the potential for market substitution.
(f) The Court of Appeals erred in resolving the

... (read more)

1Dave Orr3y

Creating competition doesn't count as harm -- it has to be direct substitution for the work in question. That's a pretty high bar. Also there are things like stable diffusion which arguably aren't commercial (the code and model are free), which further undercuts the commercial purpose angle. I'm not saying any of this is dispositive -- that's the nature of balancing tests. I think this is going to be a tough row to hoe though, and certainly not a slam dunk to say that copyright should prevent ML training on publicly available data. (Still not a lawyer, still not legal advice!)

2ChristianKl3y

Let's say an artist draws images in a very unique style. Afterwards, a lot of Dalle 2 images get created in the same style. That would make the style less unique and less valuable. It's plausible financial harm in a way that doesn't exist in the case on which the Supreme Court ruled. I don't believe it's a slam dunk either. I do believe there's room for the Supreme Court to decide either way. The fact that it's not a slam dunk either way suggests that spending money on making a stronger legal argument is valuable.

3Dave Orr3y

You can't copyright a style.

qbolec

Oct 20, 2022

It's already happening https://githubcopilotinvestigation.com/ (which I've learned yesterday from is-github-copilot-in-legal-trouble post)
I think it would be interesting plot twist: humanity saved from AI FOOM by the big IT companies having to obey intellectual property rights they themselves defended for so many years :)

Douglas_Knight

Dec 13, 2022

If you want to ban or monopolize such models, push for that directly. Indirectly banning them is evil.

They're already illegal. GPT-3 is based in large part on what appear to be pirated books. (I wonder if google's models are covered by its settlements with publishers.)

[-]gwern2y*5-3

GPT-3 is based in large part on what appear to be pirated books.

Which is argued to be "transformative" and thus not illegal.

3ChristianKl2y

Even if it's transformative they should still have to buy one license for each book to be able to use it.

4gwern2y

Why?

2ChristianKl2y

The basic idea of copyright is that if you want to acquire a copy of a book you need to buy that copy. If they just downloaded lib-gen, they didn't buy the copies of the book they use and that would be a copyright violation. That's true whether or not you afterward do something transformative.

8gwern2y

What a bizarre normative assertion. That copyright violation would be true whether or not they used it to train a model or indeed, deleted it immediately after downloading it. The copyright violation is one thing, and the model is another thing. The license that one would buy has nothing to do with any transformative ML use, and would deny that use if possible (and likely already contains language to the effect of denying as much as possible). There is no more connection than there is in the claim "if you rob a Starbucks, you should buy a pastry first".

3ChristianKl2y

Yes, the copyright violation is true whether or not they used it to train a model. Douglas_Knight's claim is that the copyright violation occurred. If that's true, that makes it possible to sue them over it.

2Douglas_Knight2y

No, OpenAI is not arguing this. They are not arguing anything, but just hiding their sources. Maybe they're arguing this about using the public web as training data, but that doesn't cover pirated books. Yes, a model is transformative, not infringement. But the question was about the training data. Is that infringement? Distributing the Pile is a tort and probably a crime by quantity. Acquiring the training data was a tort and probably a crime. I'm not sure about possessing it. Even if OpenAI is shielded from criminal responsibility, a crime was necessary for the creation and that was not enough to deter it.

4gwern2y

OpenAI is in fact arguing this and wrote one of the primary position papers on the transformative position.

0Douglas_Knight2y

Does this link say anything about their illegal acquisition of the sources? It sure looks to me like you and they are lying to distract. I condemn this lying, just as I condemned Christian's proposed lies.

[-]ChristianKl2y20

If you want to ban or monopolize such models, push for that directly.

How would you do that? How would you write the laws?

the gears to ascension

Oct 20, 2022

Yes but only because I want to make sure that anyone who runs a copy of my soul gets a license for it from me first. it's not going to slow down capabilities except differentially moving capabilities to other groups.

Nathan Helm-Burger

Oct 28, 2022

1-1

Too late. Much too late. Bird has flown the coop. No use shutting the barn doors after the cows have run away. Also, since the majority of the training data is just 'the internet', you can't get rid of this except by monitoring everyone's use of the internet to make sure they aren't making webscrapes for the purpose of later training an AI on that data. If all you do is cause some bureaucratic hassle for US companies, you give an edge to non-US groups and probably also encourage US groups to go elsewhere.

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:16 PM

[-]Shmi3y110

The usual question is... will the US cede the AI research advantage to countries which do not care about US copyright?

[-]jessicata3y51

I don't especially think AI capabilities increases are bad on the margin, but if I did I would think of this as a multilateral disarmament problem where those who have the most capabilities (relative to something else, population/economy) and worst technological coordination should disarm first, similar to nukes; that would currently indicate US and UK over China etc. China has more precedent for government control over the economy than the West, so could more easily coordinate AI slowdown.

[-]ChristianKl3y40

Export embargo's of GPU's can be used for those countries as the US is currently attempting to do with China. "Country X unfairly competes by violating US copyright" is likely a very good argument to motivate the US government to work to do embargos like that.

It's also worth noting that this doesn't hamper all AI fields. Alpha Fold for example works well on bioinformatic data that licensed in a way where everyone can use it freely.

In addition to just slowing everything down, it creates also pressures for better curation of data as it makes it easier for companies who sell curated data to operate. Having people make more explicit decision about which data to include in models might be benefitial for safety.

[-]jmh3y20

It is not clear to me export controls have been very successful so far -- in pretty much every area.

[-]ChristianKl3y20

Export controls don't completely eliminate any exports but they do make it harder and more expensive. That's especially true if someone needs a lot of GPUs.

[-]jmh3y20

I think the mistake here is thinking that just because something get harder and more expensive that it will actually slow or materially reduce some stated outcome. The reason is that the costs and burdens may not fall on the targeted locus. In the GPU case that just means that none of those GPU get into uses such as consumer markets or gaming system/cards.

While not perfect comparisons, in your view just how effective -- say in time to build, time to improve or ability to pursue an activity -- have export controls been regarding:

Putin's ability to conduct his war in Ukraine?
North Korea's nuclear and ballistic programs?
China's tech and military advancements?
Iran's nuclear and ballistic program?

For me, I can only see that the export efforts have only imposed costs on the general population (to differing degrees) rather than materially impacting the actual target activity. Perhaps I'm missing something and each of the above would be much more a problem than if the sanctions/export restrictions had not been put in place.

That said, I would agree that from a pure "on principle" basis not aiding and abetting something or someone you think is a danger to you is the right thing to do. However, I do think in this case it come much closer to virtue signalling than actions producing some binding constraint on the efforts.

[-]ChristianKl3y20

Putin's military seems to run out of high precision ammunition and does not do well on the battlefield currently. It's hard to say how much of this is due to export controls and how much is due to other factors.

North Korea doesn't seems to have a lot of reliable intercontinental missiles. Their tech development seems quite slowed down. It isn't zero but also not fast.

China's tech advancement are a result of the West outsourcing a lot of tech production to China and not having embargos. I don't know much about Chinese military tech.

Iran still doesn't have nuclear weapons. It might still get them, but there's certainly a slowdown of development.

Nuclear and ballistic technology is pretty well understood for decades developing AGI without having existing designs to copy will be much harder.

[-]Daniel Paleka3y30

Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom^[1] in 2026.

Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, ...
Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
Replacing human-generated data with model-generated data was a mistake^[2].

^{^}
In the sense that e.g. chain of thought improves capabilities is conventional wisdom in 2022.
^{^}
In the sense of x-safety. I have no confident insight either way on how abstaining from very large human-generated datasets influences capabilities long-term. If someone has, please refrain from discussing that publicly, of course.

[-]ChristianKl2y20

Discussion by Jason Calacanis and Molly Wood about why they think that artists likely have rights that can be violated by Dalle 2:

Moderation Log

LESSWRONG
LW

37

[ Question ]

Should we push for requiring AI training data to be licensed?

37

37

6 Answers sorted by
top scoring

Oct 19, 2022

Oct 19, 2022

Oct 20, 2022

Dec 13, 2022

Oct 20, 2022

Oct 28, 2022

37

[ Question ]

Should we push for requiring AI training data to be licensed?

37

37

6 Answers sorted by top scoring

Oct 19, 2022

Oct 19, 2022

Oct 20, 2022

Dec 13, 2022

Oct 20, 2022

Oct 28, 2022

6 Answers sorted by
top scoring