Open source software has long differentiated between “free as in speech” (libre) and “free as in beer” (gratis). In the first case, libre software has a license that allows the user freedom to view the source and modify it, understand it, and remix it. In the second case, gratis software does not need to be paid for, but the user doesn’t necessarily have access to the pieces, can’t make new versions, and cannot remix or change it.
...
If Open Source AI is neither gratis or libre, then those calling free model weights “Open Source,” should figure out what free means to them. Perhaps it’s “free as in oxygen” (dangerous due to reactions it can cause), or “free as in birds” (wild, without any person responsible).
I’m not necessarily opposed to judicious release of model weights, though as with any technology, designers and developers should consider the impact of their work before making or releasing it, as LeCun has recently agreed. But calling this new competitive strategy by Facebook “Open Source” without insisting on the actual features of open source is an insult to the name.
Yeah. I think it is fairly arguable that open data sets aren't required for open source; it's not the form you'd prefer to modify it in as a programmer, and they're not exactly code to start with, Shakespeare didn't write his plays as programming instructions for algorithms to generate Shakespeare-like plays. No one wants a trillion tokens that take ~$200k to 'compile' as their starting point to build from after someone else has already done that and made it available. (Hyperbolic; but the reasons someone wants that generally aren't the same reasons they'd want to compile from source code.) Open datasets are nice information to have, but lack a reasonable reproduction cost or much direct utility beyond explanation.
Llama's state as not open source is much more strongly made by the restrictions on usage, as noted in the prior discussion. There is a meaningful distinction between open dataset and closed dataset, but I'd put the jump between Mistral and Llama to be from 'open source with hidden supply chain' to 'open weights with restrictive licensing', where the jump between Mistral and RedPajama is more 'open source with hidden supply chain' to 'open source with revealed supply chain'.