I'm not sure exactly where I land on this, but I think it's important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it's worth thinking about.
I'm leaning toward the current paradigm being preferable to a full-fledged RL one, but want to add a point - one of my best guesses for proto-AGI involves massive LLMs hooked up to some RL system. This might not require RL capabilities on the same level of complexity as pure RL agents, and RL is still being worked on today.
I'm not a lawyer and this is not legal advice, but I think the current US legal framework isn't going to work to challenge training on publicly available data.
One argument that something is fair use is that it is transformative [1]. And taking an image or text and using it to slightly influence a giant matrix of numbers, in such a way that the original is not recoverable, and which allows new kinds of expression, seems likely to count as transformative.
So if you think that restricting access to public data for training purposes is a promising approach [2], you should probably focus on trying to create a new regulatory framework.
Having said that, this is all US analysis. Other countries have other frameworks and may not have exact analogs of fair use. Perhaps in the EU legal challenges are more viable.
[1] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html
[2] You should think about what the side effects would be like. For instance, this will advantage giant companies that can pay to license or create data, and places that have low respect for law. Whether that's desirable is worth thinking through.
From the overview, it seems like there's one Supreme court case and the other cases are from lower courts.
Transformativeness is only one of the factors in the Supreme court case.
The supreme court case does say:
...As to the music, this Court expresses no opinion whether repetition of the bass riff is excessive copying, but remands to permit evaluation of the amount taken, in light of the song's parodic purpose and character, its transformative elements, and considerations of the potential for market substitution.
(f) The Court of Appeals erred in resolving the
It's already happening https://githubcopilotinvestigation.com/ (which I've learned yesterday from is-github-copilot-in-legal-trouble post)
I think it would be interesting plot twist: humanity saved from AI FOOM by the big IT companies having to obey intellectual property rights they themselves defended for so many years :)
If you want to ban or monopolize such models, push for that directly. Indirectly banning them is evil.
They're already illegal. GPT-3 is based in large part on what appear to be pirated books. (I wonder if google's models are covered by its settlements with publishers.)
GPT-3 is based in large part on what appear to be pirated books.
Which is argued to be "transformative" and thus not illegal.
If you want to ban or monopolize such models, push for that directly.
How would you do that? How would you write the laws?
Yes but only because I want to make sure that anyone who runs a copy of my soul gets a license for it from me first. it's not going to slow down capabilities except differentially moving capabilities to other groups.
Too late. Much too late. Bird has flown the coop. No use shutting the barn doors after the cows have run away. Also, since the majority of the training data is just 'the internet', you can't get rid of this except by monitoring everyone's use of the internet to make sure they aren't making webscrapes for the purpose of later training an AI on that data. If all you do is cause some bureaucratic hassle for US companies, you give an edge to non-US groups and probably also encourage US groups to go elsewhere.
The usual question is... will the US cede the AI research advantage to countries which do not care about US copyright?
I don't especially think AI capabilities increases are bad on the margin, but if I did I would think of this as a multilateral disarmament problem where those who have the most capabilities (relative to something else, population/economy) and worst technological coordination should disarm first, similar to nukes; that would currently indicate US and UK over China etc. China has more precedent for government control over the economy than the West, so could more easily coordinate AI slowdown.
Export embargo's of GPU's can be used for those countries as the US is currently attempting to do with China. "Country X unfairly competes by violating US copyright" is likely a very good argument to motivate the US government to work to do embargos like that.
It's also worth noting that this doesn't hamper all AI fields. Alpha Fold for example works well on bioinformatic data that licensed in a way where everyone can use it freely.
In addition to just slowing everything down, it creates also pressures for better curation of data as it makes it easier for companies who sell curated data to operate. Having people make more explicit decision about which data to include in models might be benefitial for safety.
It is not clear to me export controls have been very successful so far -- in pretty much every area.
Export controls don't completely eliminate any exports but they do make it harder and more expensive. That's especially true if someone needs a lot of GPUs.
I think the mistake here is thinking that just because something get harder and more expensive that it will actually slow or materially reduce some stated outcome. The reason is that the costs and burdens may not fall on the targeted locus. In the GPU case that just means that none of those GPU get into uses such as consumer markets or gaming system/cards.
While not perfect comparisons, in your view just how effective -- say in time to build, time to improve or ability to pursue an activity -- have export controls been regarding:
For me, I can only see that the export efforts have only imposed costs on the general population (to differing degrees) rather than materially impacting the actual target activity. Perhaps I'm missing something and each of the above would be much more a problem than if the sanctions/export restrictions had not been put in place.
That said, I would agree that from a pure "on principle" basis not aiding and abetting something or someone you think is a danger to you is the right thing to do. However, I do think in this case it come much closer to virtue signalling than actions producing some binding constraint on the efforts.
Putin's military seems to run out of high precision ammunition and does not do well on the battlefield currently. It's hard to say how much of this is due to export controls and how much is due to other factors.
North Korea doesn't seems to have a lot of reliable intercontinental missiles. Their tech development seems quite slowed down. It isn't zero but also not fast.
China's tech advancement are a result of the West outsourcing a lot of tech production to China and not having embargos. I don't know much about Chinese military tech.
Iran still doesn't have nuclear weapons. It might still get them, but there's certainly a slowdown of development.
Nuclear and ballistic technology is pretty well understood for decades developing AGI without having existing designs to copy will be much harder.
Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.
In the sense that e.g. chain of thought improves capabilities is conventional wisdom in 2022.
In the sense of x-safety. I have no confident insight either way on how abstaining from very large human-generated datasets influences capabilities long-term. If someone has, please refrain from discussing that publicly, of course.
Discussion by Jason Calacanis and Molly Wood about why they think that artists likely have rights that can be violated by Dalle 2:
We frequently speak about AI capability gain being bad because it shortens the timeframe for AI safety research. In that logic, taking steps to decrease AI capability would be worthwhile.
At the moment the large language models are trained with a lot of data without the company, that trains the language model, licensing the data. If there would be a requirement to license the required data, that would severely reduce the available data for language models and reduce their capabilities.
It's expensive to fight lawsuits in the United States. Currently, there are artists who feel like their rights are violated by Dalle 2 using their art as training data. Similar to how Thiel funded the Gawker lawsuits, it would be possible to support artists in a suit against OpenAI to require OpenAI to license images for training Dalle 2. If such a lawsuit is well-funded it will be much more likely that a precedent for requiring data licensing gets set which would slow down AI development.
I'm curious about what people who think more about AI safety than myself think about such a move. Would it be helpful?