Amirali Abdullah

Message

Steering Language Models in Multiple Directions Simultaneously

Narmeen developed, ideated and validated K-steering at Martian. Luke generated the baselines, figures and wrote this blog post. Amir proposed the research direction and supervised the project. The full interactive blog will be available closer to the publication of the complete paper on the Martian website. TL;DR: We introduce K-steering,...

May 2, 202518

Backdoors have universal representations across large language models

by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Amirali Abdullah This work was done by Narmeen Oozeer as a research fellow at Martian, under an AI safety grant supervised by PIs Amirali Abdullah and Dhruv Nathawani. Special thanks to Sasha Hydrie, Chaithanya Bandi and Shriyash Upadhyay at Martian for suggesting researching...

Dec 6, 202418

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design....

Oct 3, 202318

LESSWRONG
LW

LESSWRONG
LW

Amirali Abdullah

Amirali Abdullah

Amirali Abdullah

Amirali Abdullah

Steering Language Models in Multiple Directions Simultaneously

Backdoors have universal representations across large language models

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Steering Language Models in Multiple Directions Simultaneously

Backdoors have universal representations across large language models

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders