If you told an AI Alignment researcher in 2018 about an alignment plan that involved collecting trajectory information of moral experts at scale and training an AI to copy it, they would point out that this would not scale to superintelligence. This is essentially what major AI Labs and most...
Epistemic Status: I wrote this for an application then realized it might be of interest to others or spark a conversation. Yoshua Bengio and LawZero are important players in AI Safety, so I think we should have a conversation about their ideas. I have two substantial concerns with Yoshua Bengio’s...
Status: Very early days of the research. No major proof of concept yet. The aim of this research update is to solicit feedback and encourage others to experiment with my code. GitHub repository Main Idea: Researchers are using SAE latents to steer model behaviors, yet human-designed selection algorithms are unlikely...
Gulsimo Osimi, Matthew Khoriaty Executive Summary This research investigates the potential of collaborative AI systems to solve complex tasks through the orchestration of multiple specialized models. By leveraging Microsoft's Autogen framework, we coordinate DeepSeek and GPT-4o Mini to tackle complex challenges. Beyond performance enhancement, our work makes a novel contribution...
EDIT: The folks at EleutherAI have taken and improved my work, and extended it to work with the Sparsify framework in addition to my use of SAE-Lens! I'm super proud that I contributed to that :) That's me! I haven't used the new system, but if you want to use...
This is a shortened version of "Preventing AI from Overthrowing the Government" from my Substack to focus on the things that would be interesting to LessWrong. Full version with citations can be found here. Introduction Many AI labs, including OpenAI, are forthright in saying that their goal is to create...