Zhijing Jin

Explaining undesirable model behavior: (How) can influence functions help?

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...

Mar 218

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

ArXiv paper here. Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI...

Feb 1714

Testing the Authoritarian Bias of LLMs

Highlights of Findings Highlight 1. Even models widely viewed as well-aligned (e.g., Claude) display measurable authoritarian leanings. When asked for role models, up to 50% of political figures mentioned are authoritarian—including controversial dictators like Muammar Gaddafi (Libya) or Nicolae Ceaușescu (Romania). Highlight 2. Queries in Mandarin elicit more authoritarian leaning...

Aug 9, 202510

Why Reasoning Isn’t Enough: How LLM Agents Struggle with Ethics and Cooperation

Every day, individuals and organizations face trade-offs between personal incentives and societal impact: 1. Externalities and public goods: From refilling the communal coffee pot to weighing the climate costs of frequent air travel, actions often impose costs or benefits on others. 2. Explicit conflicts between profit and ethical principles: This...

Jun 28, 20256

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....

Jun 11, 20256

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Summary: * Traditional LLMs outperform reasoning models in cooperative Public Goods tasks. Models like Llama-3.3-70B maintain ~90% contribution rates in public goods games, while reasoning-focused models (o1, o3 series) average only ~40%. * We observe an "increased tendency to escape regulations" in reasoning models. As models improve in analytical capabilities,...

Apr 22, 202524

Welcome to Apply: The 2024 Vitalik Buterin Fellowships in AI Existential Safety by FLI!

This is a linkpost for https://grants.futureoflife.org/ Epistemic status: Describing the fellowship that we are a part of and sharing some suggestions and experiences. The Future of Life Institute is opening its PhD and postdoc fellowships in AI Existential Safety now. Same as in the previous calls in 2022 and 2023,...

Sep 25, 20235

LESSWRONG
LW

LESSWRONG
LW

Zhijing Jin

Zhijing Jin

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Explaining undesirable model behavior: (How) can influence functions help?

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

Testing the Authoritarian Bias of LLMs

Zhijing Jin

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Explaining undesirable model behavior: (How) can influence functions help?

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

Testing the Authoritarian Bias of LLMs

Explaining undesirable model behavior: (How) can influence functions help?

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

Testing the Authoritarian Bias of LLMs

Why Reasoning Isn’t Enough: How LLM Agents Struggle with Ethics and Cooperation

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Welcome to Apply: The 2024 Vitalik Buterin Fellowships in AI Existential Safety by FLI!