Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....
Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...
DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...
tl;dr: Contribute to aisafety.info by answering questions about AI Safety from October 6th to October 9th. Participation in hackathons is the basis for applying to future fellowships, and there are prizes to be won by the top entrants. Register here and see the participant guide here. What is the schedule...
Skip to the summaries: LW | EAF Originally posted on the EA forum Zoe Williams used to manually do weekly summaries of the EA Forum and LessWrong, but now she doesn't. Hamish strung together a bunch of google apps scripts, google sheets expressions, graphQL queries, and D3.js to automatically extract...
tl;dr: Contribute to aisafety.info by writing and editing articles from August 25 to August 28 to win prizes! - Register here and see the participant guide here. What is the format of the event? The event will run from Friday August 25th, 7am UTC to Monday August 28th 2023, 7am...
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb! Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's...