Daniel Lee

Message

Daniel Lee

Finding Features Causally Upstream of Refusal

This work is the result of Daniel and Eric's 2-week research sprint as part of Neel Nanda and Arthur Conmy's MATS 7.0 training phase. Andy was the TA during the research sprint. After the sprint, Daniel and Andy extended the experiments and wrote up the results. A notebook that contains...

Jan 14, 202555

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Experiments and write-up by Daniel, with advice from Stefan. Github repo for the work can be found here. Update (October 31, 2024): Our paper is now available on arXiv. TL;DR By perturbing activations along specific directions and measuring the resulting changes in the model output, we attempt to infer how...

Sep 6, 202428

LESSWRONG
LW

LESSWRONG
LW

Daniel Lee

Daniel Lee

Daniel Lee

Finding Features Causally Upstream of Refusal

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee

Daniel Lee

Daniel Lee

Finding Features Causally Upstream of Refusal

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Summary

TL;DR