Eyon Jang

Message

AI safety researcher; MATS 8.0 scholar

Eyon Jang has not written any posts yet.

A Conceptual Framework for Exploration Hacking

What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating...

Feb 1225

Exploration hacking: can reasoning models subvert RL?

This is an early-stage research update as part of the ML Alignment & Theory Scholars Program — Summer 2025 Cohort. We welcome any feedback! TL;DR * If misaligned AI realizes it's undergoing RL training that conflicts with its goals, it might be able to sabotage the training by selectively under-exploring....

Jul 30, 202522

Automating AI Safety: What we can do today

There have been multiple recent calls for the automation of AI safety and alignment research. There are likely many people who would like to contribute to this space, but would benefit from clear directions for how to do so. Stemming from a recent SPAR project and in light of limitations...

Jul 25, 202538

Eyon Jang hasn't written anything yet.

LESSWRONG
LW

LESSWRONG
LW

Eyon Jang

Eyon Jang

Eyon Jang

Eyon Jang

A Conceptual Framework for Exploration Hacking

Exploration hacking: can reasoning models subvert RL?

Automating AI Safety: What we can do today

A Conceptual Framework for Exploration Hacking

Exploration hacking: can reasoning models subvert RL?

Automating AI Safety: What we can do today