How model editing could help with the alignment problem
Preface This article explores the potential of model editing techniques in aligning future AI systems. Initially, I was skeptical about its efficacy, especially considering the objectives of current model editing methods. I argue that merely editing "facts" isn't an adequate alignment strategy and end with suggestions for research avenues focused...
For Interpretability research, something being worked on right now are a set of tutorials which replicates results from recent papers in NNsight: https://nnsight.net/applied_tutorials/
What I find cool about this particular effort is that because the implementations are done with NNsight, it both makes it easier to adapt experiments to new models, and you can run the experiments remotely.
(Disclaimer - I work on the NDIF/NNsight project, though not on this initiative, so take my enthusiasm with a grain of salt)