Kyle O’Brien - LessWrong

What are your views on open-weights models? My thoughts after reading this post are that it may not be worth giving up the many benefits of open models if closed models are actually not significantly safer concerning these risks.

SAEs are highly dataset dependent: a case study on the refusal direction

Kyle O’Brien8mo71

Great works, folks. This further highlights a challenge that wasn't obvious to me when I first began to study SAEs — which features are learned is just super contingent on the SAE size, sparsity, and training data. Ablations like this one are important.

Looking for an alignment tutor

Kyle O’Brien3y20

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments