LESSWRONG
LW

167
evhub
14756Ω4684728060
Message
Dialogue
Subscribe

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic”

Selected work:

  • “Auditing language models for hidden objectives”
  • “Alignment faking in large language models”
  • “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”
  • “Conditioning Predictive Models”
  • “An overview of 11 proposals for building safe advanced AI”
  • “Risks from Learned Optimization”

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization
9evhub's Shortform
Ω
3y
Ω
221
Tomás B.'s Shortform
evhub15d72

Surely it's still more efficient to put that labor back into deep learning rather than GOFAI, though, no? In a world where you have enough AI labor to get GOFAI to AGI, you probably also have enough AI labor to get deep-learning-based AGI to superintelligence.

Reply
Safety researchers should take a public stance
evhub22d112

I'll just link to where I've talked about this in the past here and here and here. I think I still stand by everything I said in those comments.

Reply1
evhub's Shortform
evhub2moΩ340

In addition to the US and UK roles, we also have one for Canada.

Reply
evhub's Shortform
evhub2mo*Ω275311

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and was jointly supervised by Sam Marks (at Anthropic) and Owain Evans (not at Anthropic—a little-known fact is that we also have non-Anthropic mentors!). That's also not to say that it hasn't been great for skilling people up as well—alongside MATS (which is also similarly great!), Anthropic Fellows is probably our main recruiting pipeline for safety roles right now (so if you want to do safety research at Anthropic, Anthropic Fellows is very likely your best bet!). The program is also very open to people with a really wide range of degrees of prior experience—we take in some people with a ton of prior experience, and some people with basically none at all—so it's an especially great on ramp for new people, and also at the same time one of the best ways for more experienced people to effectively show that they have the requisite skills.

Reply4
Negative utilitarianism is more intuitive than you think
evhub2mo128

I think a lot of the asymmetry in intuitions just comes from harm often being very easy to cause but goodness generally being very hard. Goodness tends to be fragile, such that it's very easy for one defector to really mess things up, but very hard for somebody to add a bunch of additional value—e.g. it's a lot easier to murder someone than it is to found a company that produces a comparable amount of consumer surplus. Furthermore, and partially as a result, causing pain tends to be done intentionally, whereas not causing goodness tends to be done unintentionally (most of the time you could have done something good but didn't, probably you had no idea). So, our intuitions are such that causing harm is something often done intentionally, that can easily break things quite badly, and that almost anyone could theoretically do—so something people should be really quite culpable for doing. Whereas not causing goodness is something usually done unintentionally, that's very hard to avoid, and where very few people ever succeed at creating huge amounts of goodness—so something people really shouldn't be that culpable for failing at. But notably none of those reasons have anything to do with goodness or harm being inherently asymmetrical with respect to e.g. fundamental differences in how good goodness is vs. how bad badness is.

Reply1
DresdenHeart's Shortform
evhub3mo8148

I used to have this specific discussion (x-risk vs. near-term social justice) a lot when I was running the EA club at the Claremont colleges and I had great success with it; I really don't think it's that hard of a conversation to have, at least no harder than bridging any other ideological divide. If you can express empathy, show that you do in fact care about the harms they're worried about as well,[1] but then talk about how you think about scope sensitivity and cause prioritization, I've found that a lot of people are more receptive than you might initially give them credit for.


  1. Assuming you do—I think it is an important prerequisite that you do actually yourself care about social justice issues too. And I think you should care about social justice issues to some extent; they are real issues! If you feel like they aren't real issues, I'd probably recommend reading more history; I think sometimes people don't understand the degree to which we are still e.g. living with the legacy of Jim Crow. But of course I think you should care about social justice much less than you should care about x-risk. ↩︎

Reply
evhub's Shortform
evhub3mo*Ω12212

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!

4.1. The evidence for AI scheming is often anecdotal

Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.

I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.

I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.

4.2. Studies of AI scheming often lack hypotheses and control conditions

Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.

I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:

  1. Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
    1. An example of work in this bucket would be "Alignment Faking in Large Language Models".
    2. Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
  2. Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
    1. An example of work in this bucket would be "Frontier Models are Capable of In-context Scheming".
    2. Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it's possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don't look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
  3. Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
    1. An example of work in this bucket would be "Agentic Misalignment: How LLMs Could be Insider Threats".
    2. Importance of controls: This is the main case where I think controls are less relevant, since you're locked in an adversarial process where you're just trying to falsify some universally quantified claim like "the model always acts in line with the spec". Of course, models don't always act in line with their specs, but the goal I am usually trying to hit with this work is something like "the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with" (though in practice you might exclude some that are obvious and well-known like jailbreaks).

4.3. Studies of AI scheming have weak or unclear theoretical motivation

The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.

I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:

Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.

This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.

4.4. Findings are often interpreted in exaggerated or unwarranted ways

The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.

I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:

More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.

I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.


Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.


  1. The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎

Reply
Nina Panickssery's Shortform
evhub3mo20

Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?

Reply
evhub's Shortform
evhub4mo2215

Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.

I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.

Reply1
Agentic Misalignment: How LLMs Could be Insider Threats
evhub4moΩ575

I wrote a response to some of the critiques of the realism of this work and some more thoughts on the overall rationale for why I think it's useful to do this sort of work here.

Reply
Load More
47Building and evaluating alignment auditing agents
Ω
3mo
Ω
1
78Agentic Misalignment: How LLMs Could be Insider Threats
Ω
4mo
Ω
13
141Auditing language models for hidden objectives
Ω
7mo
Ω
15
131Training on Documents About Reward Hacking Induces Reward Hacking
Ω
9mo
Ω
15
489Alignment Faking in Large Language Models
Ω
9mo
Ω
75
93Catastrophic sabotage as a major threat model for human-level AI systems
Ω
11mo
Ω
13
95Sabotage Evaluations for Frontier Models
Ω
1y
Ω
56
19Automating LLM Auditing with Developmental Interpretability
1y
0
163Sycophancy to subterfuge: Investigating reward tampering in large language models
Ω
1y
Ω
22
81Reward hacking behavior can generalize across tasks
Ω
1y
Ω
5
Load More