evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic”

Selected work:

Sequences

Conditioning Predictive Models

ML Alignment Theory Scholars Program Winter 2021

Risks from Learned Optimization

Posts

Sorted by New

9evhub's Shortform

Ω

3y

Ω

159

137Auditing language models for hidden objectives

Ω

16d

Ω

7

131Training on Documents About Reward Hacking Induces Reward Hacking

Ω

2mo

Ω

14

482Alignment Faking in Large Language Models

Ω

2mo

Ω

74

92Catastrophic sabotage as a major threat model for human-level AI systems

Ω

4mo

Ω

13

94Sabotage Evaluations for Frontier Models

Ω

5mo

Ω

56

19Automating LLM Auditing with Developmental Interpretability

7mo

0

161Sycophancy to subterfuge: Investigating reward tampering in large language models

Ω

9mo

Ω

22

79Reward hacking behavior can generalize across tasks

Ω

10mo

Ω

5

95Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Ω

11mo

Ω

13

133Simple probes can catch sleeper agents

Ω

1y

Ω

21

Wikitag Contributions

Comments

Sorted by

Newest

Reducing LLM deception at scale with self-other overlap fine-tuning

evhub12d*Ω453

It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

evhub12dΩ220

I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

evhub13dΩ8163

Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.

Reply

1

Six Thoughts on AI Safety

evhub24d*53

Actually, I'd be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I'd guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don't think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

Reply

nikola's Shortform

evhub2mo110

redesigned

What did it used to look like?

Reply

evhub's Shortform

evhub2moΩ1322-3

Some random thoughts on CEV:

To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.^[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don't want to incentivize any of that sort of hacking either.
2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There's a bunch of random chance here that imo shouldn't be morally relevant.
3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.

I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎

Reply

1

Anti-Slop Interventions?

evhub2moΩ7104

A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

Reply

1

Alignment Faking in Large Language Models

evhub2moΩ6110

We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.

Reply

evhub's Shortform

evhub2moΩ5100

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

Reply

RSPs are pauses done right

evhub2moΩ44-4

I wish the post more strongly emphasized that regulation was a key part of the picture

I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

Reply