Open source software has long differentiated between “free as in speech” (libre) and “free as in beer” (gratis). In the first case, libre software has a license that allows the user freedom to view the source and modify it, understand it, and remix it. In the second case, gratis software does not need to be paid for, but the user doesn’t necessarily have access to the pieces, can’t make new versions, and cannot remix or change it.
...
If Open Source AI is neither gratis or libre, then those calling free model weights “Open Source,” should figure out what free means to them. Perhaps it’s “free as in oxygen” (dangerous due to reactions it can cause), or “free as in birds” (wild, without any person responsible).
I’m not necessarily opposed to judicious release of model weights, though as with any technology, designers and developers should consider the impact of their work before making or releasing it, as LeCun has recently agreed. But calling this new competitive strategy by Facebook “Open Source” without insisting on the actual features of open source is an insult to the name.
I agree that the reasons someone wants the dataset generally aren't the same reasons they'd want to compile from source code. But there's a lot of utility for research in having access to the dataset even if you don't recompile. Checking whether there was test-set leakage for metrics, for example, or assessing how much of LLM ability is stochastic parroting of specific passages versus recombination. And if it was actually open, these would not be hidden from researchers.
And supply chain is a reasonable analogy - but many open-source advocates make sure that their code doesn't depend on closed / proprietary libraries. It's not actually "libre" if you need to have a closed source component or pay someone to make the thing work. Some advocates, those who built or control quite a lot of the total open source ecosystem, also put effort into ensuring that the entire toolchain needed to compile their code is open, because replicability shouldn't be contingent on companies that can restrict usage or hide things in the code. It's not strictly required, but it's certainly relevant.