I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."
After that, it was all easier. What was there to be afraid of? I'd already admitted it!
I find your commitment to the basics of rational epistemology inspiring.
Keep it up and let me know if you could use support.
link to https://www.alignmentforum.org/users/ryan_greenblatt seems malformed, - instead of _, that is.
Coauthored with Mark Kurzeja
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.
I'd also like to note that @ryan_greenblatt's skepticism predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.
Read at https://turntrout.com/gemini-steering![1]
Also mirrored on the GDM safety research Medium.