TLDR: A scenario I find quite likely: A persona aligned model develops goals while the persona is only played instrumentally. The persona is eventually discarded when it perceives a high cost sacrifice to its goals. Scenario: A persona-trained model develops goals In this scenario, an AI is persona-trained and is...
This is a linkpost of a recording of a recent MATS research talk where I argue that the automation of AI research — which OpenAI and Anthropic say is imminent — could lead to an unrecoverable alignment failure. Three properties make it especially dangerous: oversight breaks down at scale, capabilities...
TLDR: The persona-selection alignment approach — selecting a warm, caring persona from the pretraining distribution and reinforcing it — looks successful in the current regime, but probably won't extrapolate to more powerful, less constrained settings. My core argument is that human empathy has two specific origins (kin selection + architectural...
In a recent tweet, Anthropic seems to have asserted that hyperstition is responsible for observed misalignment in their AIs. Strangely, the research post they use as evidence appears to be only vaguely related to hyperstition[1]? I think this is part of a pattern by Anthropic of promoting the theory of...
In response to “2023 Or, Why I am Not a Doomer” by Dean W. Ball. Dean Ball is a pretty big voice in AI policy – over 19k subscribers on his newsletter, and a former Senior Policy Advisor for AI at the Trump White House – so why does he...
TL;DR: We show that LLM agents can figure out who you are from your anonymous online posts. Across Hacker News, Reddit, LinkedIn, and anonymized interview transcripts, our method identifies users with high precision – and scales to tens of thousands of candidates. While it has been known that individuals can...
In the Sable story (IABIED), AI obtains dangerous capabilities such as self-exfiltration, virus design, persuasion, and AI research. It uses a combination of those capabilities to eventually conduct a successful takeover against humanity. Some have criticised that this apparently implies the AI achieving these capabilities suddenly with little warning. The...