Navigating the Open-Source AI Landscape: Data, Funding, and Safety
Introduction Open-source AI promises to democratize technology, but it can also be abused and may lead to hard-to-control AI. In this post, we'll explore the data, ethics, and funding behind these models to discover how to balance innovation and safety. Summary Open-source models, like LLaMA and GPT-NeoX, are trained on huge public datasets of internet data, such as the Pile, which has 800 GB of books, medical research, and even emails of Enron employees before their company went bankrupt and they switched careers to professional hide-and-seek. After the unexpected leak of Meta’s LLaMA, researchers cleverly enhanced it with ChatGPT outputs, creating chatbots Alpaca and Vicuna. These new bots perform nearly as well as GPT-3.5 and cost less to train — Alpaca took just 3 hours and $600. The race is on to run AI models on everday devices like smartphones, even on calculators. The leading image generation model, Stable Diffusion, is developed by Stability AI — a startup that has amassed $100 million in funding, much like Hugging Face (known as the "Github of machine learning"). These two unicorn startups financed a nonprofit to collect 5 billion images for training the model. Sourced from the depths of the internet, this public dataset raises concerns about copyright and privacy, as it includes thousands of private medical files. The open-source AI community wants to make AI accessible and prevent Big Tech from controlling it. However, risks like malicious use of Stable Diffusion exist, as its safety filters can be easily removed. Misuse includes virtually undressing people. If we're still struggling with how to stop a superintelligent AI that doesn't want to be turned off — like Skynet in Terminator — how can we keep open-source AI from running amok in the digital wild? Open-source code helps create advanced AI faster by letting people use each other's work, but this could be risky if this AI isn't human-friendly. However, to ensure safety, researchers need acce