AdamGleave

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...

Jun 5, 202522

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed. An...

Feb 7, 202537

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities. Our...

Nov 1, 202418

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Work done at FAR AI. There has been a lot of conceptual work on mesa-optimizers: neural networks that develop internal goals that may differ from their training objectives (the inner alignment problem). There is an abundance of good ideas for empirical work (find search in a NN, interpret it), but...

Jul 25, 202459

Does robustness improve with scale?

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to...

Jul 25, 202414

Beyond the Board: Exploring AI Robustness Through Go

Last year, we showed that supposedly superhuman Go AIs can be beaten by human amateurs playing specific “cyclic” patterns on the board. Vulnerabilities have previously been observed in a wide variety of sub- or near-human AI systems, but this result demonstrates that even far superhuman AI systems can fail catastrophically...

Jun 19, 202441

More people getting into AI safety should do a PhD

Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is...

Mar 14, 202461

LESSWRONG
LW

LESSWRONG
LW

AdamGleave

AdamGleave

Even Superhuman Go AIs Have Surprising Failure Modes

AI Safety in a World of Vulnerable Machine Learning Systems

Introducing the Fund for Alignment Research (We're Hiring!)

More people getting into AI safety should do a PhD

AdamGleave

Even Superhuman Go AIs Have Surprising Failure Modes

AI Safety in a World of Vulnerable Machine Learning Systems

Introducing the Fund for Alignment Research (We're Hiring!)

More people getting into AI safety should do a PhD

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Does robustness improve with scale?

Beyond the Board: Exploring AI Robustness Through Go

More people getting into AI safety should do a PhD