This is why we introduced X-Risk Sheets, a questionnaire that researchers should include in their paper if they're claiming that their paper reduces AI x-risk. This way researchers need to explain their thinking and collect evidence that they're not just advancing capabilities.
We now include these x-risk sheets in our papers. For example, here is an example x-risk sheet included in an arXiv paper we put up yesterday.
At first glance of seeing this, I'm reminded of the safety questionnaires I had to fill out as part of running a study when taking experimental psychology classes in undergrad. It was a lot of annoyance and mostly a box ticking exercise. Everyone mostly did what they wanted to do anyway, and then hurriedly gerrymandered that questionnaire right before the deadline, so the faculty would allow them to proceed. Except the very conscientious students, who saw this as an excellent opportunity to prove their box ticking diligence.
As a case in point, I might even point to the arXiv paper you cite that has an x-risk sheet at the end.
In short, this could help resolve matters of fact that influence policies and decisions made by political leaders in an increasingly complex modern world, putting humanity in a better place to deal with the global turbulence and uncertainty created by AI systems when they rapidly reshape society. A fuller motivation for “ML for Improving Epistemics” is described in Hendrycks and Mazeika (2022).
I think that for most AI papers that come out you can find a corner case at this level of abstraction where that paper helps with x-risk, basically regardless of whether it actually does. For what it's worth, I spent about 2 years of my life working full-time on building a platform where top forecasters attacked AI questions, and wrote what I believe at the time was the most well-written database of forecasting questions on AI. I eventually stopped working on that, because I believe the work didn't really matter for alignment.
(So, overall, I'm pretty worried that, if used frequently, these kinds of spreadsheets will mostly increase safetywashing, because they'll mostly make lots of researchers talk about safety even though they have nothing to say, and there'll be pressure toward saying that their thing helps with safety.)
[I work for Dan Hendrycks but he hasn't reviewed this.]
It seems to me like your comment roughly boils down to "people will exploit safety questionaires." I agree with that. However, I think they are much more likely to exploit social influence, blog posts, and vagueness than specific questionaires. The biggest strengths of the x-risk sheet, in my view, are:
(1) It requires a specific explanation of how the paper is relevant to x-risk, which cannot be tuned depending on the audience one is talking to. You give the example from the forecasting paper and suggest it's unconvincing. The counterfactual is that the forecasting paper is released, the authors are telling people and funders that it's relevant to safety, and there isn't even anything explicitly written for you to find unconvincing and argue against. The sheets can help resolve this problem (though in this case, you haven't really said why you find it unconvincing). Part of the reason I was motivated to write Pragmatic AI Safety (which covers many of these topics) was so that the ideas in it are staked out clearly. That way people can have something clear to criticize, and it also forces their criticisms to be more specific.
(2) There is a clear trend of saying that papers that are mostly about capabilities are about safety. This sheet forces authors to directly address this in their paper, and either admit the fact that they are doing capabilities or attempt to construct a contorted and falsifiable argument otherwise.
(3) The standardized form allows for people to challenge specific points made in the x-risk sheet, rather than cherrypicked things the authors feel like mentioning in conversation or blog posts.
Your picture of faculty simply looking at the boxes being checked and approving is, I hope, not actually how funders in the AI safety space are operating (if they are, then yes, no x-risk sheet can save them). I would hope that reviewers and evaluators of papers will directly address the evidence for each piece of the x-risk sheet and challenge incorrect assertions.
I'd be a bit worried if x-risk sheets were included in every conference, but if you instead just make them a requirement for "all papers that want AI safety money" or "all papers that claim to be about AI safety" I'm not that worried that the sheets themselves would make any researchers talk about safety if they were not already talking about it.
Another issue with greenwashing and safetywashing is that it gives people who earnestly care a false impression that they are meaningfully contributing.
Despite thousands of green initiatives, we're likely to blow way past the 1.5c mark because the far majority of those initiatives failed to address the core causes of climate change. Each plastic-straw ban and reusable diaper gives people an incorrect impression that they are doing something meaningful to improve the climate.
Similarly I worry that many people will convince themselves that they are doing something meaningful to improve AI Safety, but because they failed to address the core issues they end up contributing nothing. I am not saying this as a pure hypothetical, I think this is already happening to a large extent.
I quit a well paying job to become a policy trainee working with AI in the European Parliament because I was optimizing for "do something which looks like contributing to AI safety", with a strenuous at best model of how my work would actually lead to a world which creates safe AI. What horrified me during this was that a majority of people I spoke to in the field of AI policy seemed to be making similar errors as I was.
Many of us justify our work this by pointing out the second-order benefits such as "policy work is field-building", "This policy will help create better norms" or "I'm skilling up / getting myself to a place of influence", and while these second order effects are real and important, we should be very sceptical of interventions whose first-order effects aren't promising.
I apologize that this became a bit of a rant about AI Policy, but I have been annoyed with myself for making such basic errors and this post helped me put a word on what I was doing.
I’m worried about this too, especially since I think it’s surprisingly easy here (relative to most fields/goals) to accidentally make the situation even worse. For example, my sense is people often mistakenly conclude that working on capabilities will help with safety somehow, just because an org's leadership pays lip service to safety concerns—even if the org only spends a small fraction of its attention/resources on safety work, actively tries to advance SOTA, etc.
we should be very sceptical of interventions whose first-order effects aren't promising.
This seems reasonable, but I think this suspicion is currently applied too liberally. In general, it seems like second-order effects are often very large. For instance, some AI safety research is currently funded by a billionaire whose path to impact on AI safety was to start a cryptocurrency exchange. I've written about the general distaste for diffuse effects and how that might be damaging here; if you disagree I'd love to hear your response.
In general, I don't think it makes sense to compare policy to plastic straws, because I think you can easily crunch the numbers and find that plastic straws are a very small part of the problem. I don't think it's even remotely the case that "policy is a very small part of the problem" since in a more cooperative world I think we would be far more prudent with respect to AI development (that's not to say policy is tractable, just that it seems very much more difficult to dismiss than straws).
This is why we need to say "existential safety" or "x-safety" more often. "Long-term risk" is no longer an appropriate synonym for existential risk, though "large-scale risk" is still fitting.
I've used the term "safetwashing" at least once every week or two in the last year. I don't know whether I've picked it up from this post, but it still seems good to have an explanation of a term that is this useful and this common that people are exposed to.
I just want to point out that safety-washing is term I heard a lot when I was working on AI Ethics in 2018. It seemed like a pretty well-known term at the time, at least to the people I talked to in that community. Not sure how widespread it is in other disciplines.
I think most of our conversations about it were on Twitter and maybe Slack so maybe that makes a difference?
I like this. It adds a perception lens and an easy way to name it (and thus cognitively and socially notice it).
I imagine the following is obvious to you and to many readers, but for those for whom it's not, I think the link is important: Both greenwashing and safetywashing are downstream of Goodhart's Law.
(Occasionally I wonder how much of the Art of Rationality amounts to noticing & adjusting for Goodhart drift. It seems like a stunningly enormous proportion each time I check.)
A tongue-in-cheek suggestion for noticing this phenomena: when you encounter professions of concern about alignment, ask yourself whether it seems like the person making those claims is hoping you’ll react like the marine mammals in this DuPont advertisement, dancing to Beethoven’s “Ode to Joy” about the release of double-hulled oil tankers.
I think of greenwashing as something that works on people who are either not paying much attention, not very smart, or incentivized to accept the falsehoods, or some combination of these. Similarly, safetywashing looks to me like something that will present an obstacle to any attempts to use politicians or the general public to exert pressure, and that will help some AI capabilities researchers manage their cognitive dissonance. Looking at, eg, the transformers-to-APIs example, I have a hard time imagining a smart person being fooled on the object level.
But it looks different at simulacrum level 3. On that level, safetywashing is "affiliating with AI safety", and the absurdity of the claim doesn't matter unless there's actual backlash, which there aren't many people who have time to critique the strategies of second- and third-tier AI companies.
"Safewashing" would be more directly parallel to "greenwashing" and sounds less awkward to my ears than "safetywashing", but on the other hand the relevant ideas are more often called "AI safety" than "safe AI", so I'm not sure if it's a better or worse term.
I prefer safe washing, but vote that we make a huge divisive issue over it, ultimately splitting the community in two.
Safetywashing describes a phenomenon that is real, inevitable, and profoundly unsurprising (I am still surprised whenever I see it, but that's my fault for knowing something is probable and being surprised anyway). Things like this are fundamental to human systems; people who read the Sequences know this.
This post doesn't prepare people, at all, for the complexity of how this would play out in reality. It's possible that most posts would fail to prepare people, because these posts change goalposts; and in the mundane process of following their incentives, both adversaries and wishful thinkers (and everything in between) automatically adapt around the cultural expectations set. However, it is a critical first step and vastly superior to nothing at all.
Anticipating ways for Goodhart's law to play out in reality isn't a nerdy hobby, it isn't even a way of life, it's being an adult/agent in the real world.
In southern California there’s a two-acre butterfly preserve owned by the oil company Chevron. They spend little to maintain it, but many millions on television advertisements featuring it as evidence of their environmental stewardship.[1]
Environmentalists have a word for behavior like this: greenwashing. Greenwashing is when companies misleadingly portray themselves, or their products, as more environmentally-friendly than they are.
Greenwashing often does cause real environmental benefit. Take the signs in hotels discouraging you from washing your towels:
My guess is that the net environmental effect of these signs is in fact mildly positive. And while the most central examples of greenwashing involve deception, I’m sure some of these signs are put up by people who earnestly care. But I suspect hotels might tend to care less about water waste if utilities were less expensive, and that Chevron might care less about El Segundo Blue butterflies if environmental regulations were less expensive.
The field of AI alignment is growing rapidly. Each year it attracts more resources, more mindshare, more people trying to help. The more it grows, the more people will be incentivized to misleadingly portray themselves or their projects as more alignment-friendly than they are.
I think some of this is happening already. For example, a capabilities company launched recently with the aim of training transformers to use every API in the world, which they described as the “safest path to general intelligence.” As I understand it, their argument is that this helps with alignment because it involves collecting feedback about people’s preferences, and because humans often wish AI systems could more easily take actions in the physical world, which is easier once you know how to use all the APIs.[2]
It’s easier to avoid things that are easier to notice, and easier to notice things with good handles. So I propose adopting the handle “safetywashing.”
From what I can tell, the original source for this claim is the book “The Corporate Planet: Ecology and Politics in the Age of Globalization,” which from my samples seems about as pro-Chevron as you’d expect from the title. So I wouldn’t be stunned if the claim were misleading, though the numbers passed my sanity check, and I did confirm the preserve and advertisements exist.
I haven’t talked with anyone who works at this company, and all I know about their plans is from the copy on their website. My guess is that their project harms, rather than helps, our ability to ensure AGI remains safe, but I might be missing something.