evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic”

Selected work:

Sequences

Conditioning Predictive Models

ML Alignment Theory Scholars Program Winter 2021

Risks from Learned Optimization

Wikitag Contributions

Comments

Sorted by

Newest

jacquesthibs's Shortform

evhub11d20-1

I think selling alignment-relevant RL environments to labs is underrated as an x-risk-relevant startup idea. To be clear, x-risk-relevant startups is a pretty restricted search space; I'm not saying that one necessarily should be founding a startup as the best way to address AI x-risk, but just operating under the assumption that we're optimizing within that space, selling alignment RL environments is definitely the thing I would go for. There's a market for it, the incentives are reasonable (as long as you are careful and opinionated about only selling environments you think are good for alignment, not just good for capabilities), and it gives you a pipeline for shipping whatever alignment interventions you think are good directly into labs' training processes. Of course, that's dependent on you actually having a good idea for how to train models to be more aligned, and that intervention being in the form of something you can sell, but if you can do that, and you can demonstrate that it works, you can just sell it to all the labs, have them all use it, and then hopefully all of their models will now be more aligned. E.g. if you're excited about character training, you can just replicate it, sell it to all the labs, and then in so doing change how all the labs are training their models.

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

evhub18d3411

One point: verbalized eval awareness seems much better than unverbalized eval awareness. So one worry I'd have is that trying very hard to avoid training environments which incentivize verbalized eval awareness could just lead to a bunch of unverbalized eval awareness instead. Maybe the fact that you see a bunch of verbalized eval awareness here is actually a really good thing that you want to preserve for monitoring and faithfulness purposes. Curious for thoughts on that question.

Tomás B.'s Shortform

evhub2mo72

Surely it's still more efficient to put that labor back into deep learning rather than GOFAI, though, no? In a world where you have enough AI labor to get GOFAI to AGI, you probably also have enough AI labor to get deep-learning-based AGI to superintelligence.

Safety researchers should take a public stance

evhub2mo112

I'll just link to where I've talked about this in the past here and here and here. I think I still stand by everything I said in those comments.

evhub's Shortform

evhub3moΩ340

In addition to the US and UK roles, we also have one for Canada.

evhub's Shortform

evhub3mo*Ω275311

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and was jointly supervised by Sam Marks (at Anthropic) and Owain Evans (not at Anthropic—a little-known fact is that we also have non-Anthropic mentors!). That's also not to say that it hasn't been great for skilling people up as well—alongside MATS (which is also similarly great!), Anthropic Fellows is probably our main recruiting pipeline for safety roles right now (so if you want to do safety research at Anthropic, Anthropic Fellows is very likely your best bet!). The program is also very open to people with a really wide range of degrees of prior experience—we take in some people with a ton of prior experience, and some people with basically none at all—so it's an especially great on ramp for new people, and also at the same time one of the best ways for more experienced people to effectively show that they have the requisite skills.

Negative utilitarianism is more intuitive than you think

evhub3mo128

I think a lot of the asymmetry in intuitions just comes from harm often being very easy to cause but goodness generally being very hard. Goodness tends to be fragile, such that it's very easy for one defector to really mess things up, but very hard for somebody to add a bunch of additional value—e.g. it's a lot easier to murder someone than it is to found a company that produces a comparable amount of consumer surplus. Furthermore, and partially as a result, causing pain tends to be done intentionally, whereas not causing goodness tends to be done unintentionally (most of the time you could have done something good but didn't, probably you had no idea). So, our intuitions are such that causing harm is something often done intentionally, that can easily break things quite badly, and that almost anyone could theoretically do—so something people should be really quite culpable for doing. Whereas not causing goodness is something usually done unintentionally, that's very hard to avoid, and where very few people ever succeed at creating huge amounts of goodness—so something people really shouldn't be that culpable for failing at. But notably none of those reasons have anything to do with goodness or harm being inherently asymmetrical with respect to e.g. fundamental differences in how good goodness is vs. how bad badness is.

DresdenHeart's Shortform

evhub4mo8148

I used to have this specific discussion (x-risk vs. near-term social justice) a lot when I was running the EA club at the Claremont colleges and I had great success with it; I really don't think it's that hard of a conversation to have, at least no harder than bridging any other ideological divide. If you can express empathy, show that you do in fact care about the harms they're worried about as well,^[1] but then talk about how you think about scope sensitivity and cause prioritization, I've found that a lot of people are more receptive than you might initially give them credit for.

Assuming you do—I think it is an important prerequisite that you do actually yourself care about social justice issues too. And I think you should care about social justice issues to some extent; they are real issues! If you feel like they aren't real issues, I'd probably recommend reading more history; I think sometimes people don't understand the degree to which we are still e.g. living with the legacy of Jim Crow. But of course I think you should care about social justice much less than you should care about x-risk. ↩︎

evhub's Shortform

evhub4mo*Ω12212

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.^[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!

4.1. The evidence for AI scheming is often anecdotal

Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.

I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.

I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.

4.2. Studies of AI scheming often lack hypotheses and control conditions

Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.

I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:

Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
1. An example of work in this bucket would be "Alignment Faking in Large Language Models".
2. Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
1. An example of work in this bucket would be "Frontier Models are Capable of In-context Scheming".
2. Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it's possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don't look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
1. An example of work in this bucket would be "Agentic Misalignment: How LLMs Could be Insider Threats".
2. Importance of controls: This is the main case where I think controls are less relevant, since you're locked in an adversarial process where you're just trying to falsify some universally quantified claim like "the model always acts in line with the spec". Of course, models don't always act in line with their specs, but the goal I am usually trying to hit with this work is something like "the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with" (though in practice you might exclude some that are obvious and well-known like jailbreaks).

4.3. Studies of AI scheming have weak or unclear theoretical motivation

The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.

I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:

Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.

This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.

4.4. Findings are often interpreted in exaggerated or unwarranted ways

The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.

I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:

More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.

I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.

Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.

The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎

Nina Panickssery's Shortform

evhub5mo20

Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments