Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design....