Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.
Actually, I'd be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I'd guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don't think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.
redesigned
What did it used to look like?
Some random thoughts on CEV:
I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎
A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!
We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
I wish the post more strongly emphasized that regulation was a key part of the picture
I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.
It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.