This is my first post on here so please be lenient if I fail to follow any norms.
With the explosion of AI-generated images and text from Dall-E 2, Midjourney AI and GPT3, it does not seem unreasonable to assume that a non-negligible part of the content of the internet might become AI-generated. This in itself is not problematic. I actually look forward to the internet where everyone has access to the means to create awesome media for almost nothing.
But, since most large language models and multimodal AIs are trained on a dataset that basically consists of "all of the internet", we could end up with a feedback loop. If a lot of the old model's output is used as part of the training data for a new model, the development will slow down and the new models will essentially emulate the old ones. Maybe I'm underestimating the amount of content on the internet or overestimating the use of AI. This would of course also only occur if the development of AI is slow enough for the Internet to "fill up" with AI-generated content.
And sure we could say that we only use human-generated content for the training data, but since we're literally trying to make AI-generated content indistinguishable from human-generated content, we are actively working against ourselves and if we have any success we won't have success.
Essentially my question is "is training data dilution going to be a thing?"
I wouldn't be surprised if training data becomes hard to come by for any reason, including dilution.