Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...
ArXiv paper here. Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI...
Highlights of Findings Highlight 1. Even models widely viewed as well-aligned (e.g., Claude) display measurable authoritarian leanings. When asked for role models, up to 50% of political figures mentioned are authoritarian—including controversial dictators like Muammar Gaddafi (Libya) or Nicolae Ceaușescu (Romania). Highlight 2. Queries in Mandarin elicit more authoritarian leaning...
Every day, individuals and organizations face trade-offs between personal incentives and societal impact: 1. Externalities and public goods: From refilling the communal coffee pot to weighing the climate costs of frequent air travel, actions often impose costs or benefits on others. 2. Explicit conflicts between profit and ethical principles: This...
TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....
Summary: * Traditional LLMs outperform reasoning models in cooperative Public Goods tasks. Models like Llama-3.3-70B maintain ~90% contribution rates in public goods games, while reasoning-focused models (o1, o3 series) average only ~40%. * We observe an "increased tendency to escape regulations" in reasoning models. As models improve in analytical capabilities,...
This is a linkpost for https://grants.futureoflife.org/ Epistemic status: Describing the fellowship that we are a part of and sharing some suggestions and experiences. The Future of Life Institute is opening its PhD and postdoc fellowships in AI Existential Safety now. Same as in the previous calls in 2022 and 2023,...