Steering Gemini with BiDPO

TurnTrout

104 Steering Gemini with BiDPO

by TurnTrout

31st Jan 2025

AI Alignment Forum

1 min read

104 Ω 52

This is a linkpost for https://turntrout.com/gemini-steering

Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness ／factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.

I'd also like to note that @ryan_greenblatt's skepticism predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.

Read at https://turntrout.com/gemini-steering!^[1]

^{^}
Also mirrored on the GDM safety research Medium.

Activation EngineeringPostmortems & RetrospectivesAI

Frontpage

104 Ω 52

Steering Gemini with BiDPO

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:34 AM

[-]TurnTrout2moΩ511025

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."

After that, it was all easier. What was there to be afraid of? I'd already admitted it!