Open source software has long differentiated between “free as in speech” (libre) and “free as in beer” (gratis). In the first case, libre software has a license that allows the user freedom to view the source and modify it, understand it, and remix it. In the second case, gratis software does not need to be paid for, but the user doesn’t necessarily have access to the pieces, can’t make new versions, and cannot remix or change it.
...
If Open Source AI is neither gratis or libre, then those calling free model weights “Open Source,” should figure out what free means to them. Perhaps it’s “free as in oxygen” (dangerous due to reactions it can cause), or “free as in birds” (wild, without any person responsible).
I’m not necessarily opposed to judicious release of model weights, though as with any technology, designers and developers should consider the impact of their work before making or releasing it, as LeCun has recently agreed. But calling this new competitive strategy by Facebook “Open Source” without insisting on the actual features of open source is an insult to the name.
Oh, I do, they're just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
(It's going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)