Your title question and closing question is not the same. Oil is valuable when demand exceeds supply and when easy-to-access supply declines and needs to be replaced with harder-to-access supply. Demand grows when there are more things we can do with oil and falls when we find better ways to do those things. When and where oil is abundant and easy to access, it's cheap. Why should data be different?
But metaphors aside, your closing question is still interesting. More data should never have zero or negative value. And if the key is choosing the right data, then I'd expect the heavily-trained-on-big-data LLMs or equivalent will be part of the solution for generating and validating and curating data for training more efficient, targeted models in the future.
Consider the limiting cases. Right now we start with a randomized initial model and use lots of data. The far of the intelligence curve may or may not look like Deep Thought's "even before its data banks had been connected up it had started from I think therefore I am and got as far as deducing the existence of rice pudding and income tax before anyone managed to turn it off" level. But even for Deep Thought more data is still better. It can make better use of any individual piece of data than any less capable system could. It just may not need it. So the value can be high, even if the price-per-unit it's willing to pay is low because of its excellent BATNA. Just like how the value of a tiny amount of precious metal is extremely high to me for what it does in my electronics and in catalysts used to make products I buy (compared to the ancients who could only use it for decoration, jewelry, and money), but that doesn't mean I'll pay an exorbitant amount for lots of gold.
That's as far as my speculation can go here, but that's how I think about it.
In the age of big data and machine learning, it was common to say that data was the new oil. Every big company spent a lot of money creating their Hadoop instances, hiring data scientists, to discover new things from the data they had. It's easy to see how the big data promise failed to live up to its promise.
Today, when companies go train their LLMs, they can use databases that are publicly available: the Common Crawl was 82% of the raw tokens used to train GPT-3 and the latest version is 428TB. The novel word in town is "high value tokens", in this world, adding some data generated by humans exclusively for the dataset (aka: RLHF) helps way more than just adding lots of unrelated data.
But I am seeing tech executives talking a lot about how data is important, which goes opposite to my understanding of AI. Here's Larry Ellison in the latest Oracle earnings call, replying to a question if the data gravity in his systems of record would change with the advent of AI:
This trend is common elsewhere with other companies' executives (obviously Ellison is biased to give answers that put Oracle is a good spot).
But what Ellison is saying goes contrary to my intuition about AI. Companies like Nuance or Cerner may need data to train their medical AI LLMs, but a hospital, or even an insurer like UnitedHealth or a farmaCo like Pfizer have no edge whatsoever for having more data. Most of these tokens will be bought, generated by humans (you can literally hire 500 doctors to generate these tokens), and use what is publicly available.
In this world, some big companies will spend a lot to generate data (it's speculated that Google already have a data budget in the billions), they'll buy from other companies and pay human beings to generate it (it makes total sense when there's a real possibility the world is spending $100B in GPUs in 2024). But you want high quality data and data that isn't repetitive.
Does Less Wrong agree that data is less valuable in the new world of AI?