AI Safety Seems Hard to Measure

HoldenKarnofsky

In previous pieces, I argued that there's a real and large risk of AI systems' developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening.

A young, growing field of AI safety research tries to reduce this risk, by finding ways to ensure that AI systems behave as intended (rather than forming ambitious aims of their own and deceiving and manipulating humans as needed to accomplish them).

Maybe we'll succeed in reducing the risk, and maybe we won't. Unfortunately, I think it could be hard to know either way. This piece is about four fairly distinct-seeming reasons that this could be the case - and that AI safety could be an unusually difficult sort of science.

This piece is aimed at a broad audience, because I think it's important for the challenges here to be broadly understood. I expect powerful, dangerous AI systems to have a lot of benefits (commercial, military, etc.), and to potentially appear safer than they are - so I think it will be hard to be as cautious about AI as we should be. I think our odds look better if many people understand, at a high level, some of the challenges in knowing whether AI systems are as safe as they appear.

First, I'll recap the basic challenge of AI safety research, and outline what I wish AI safety research could be like. I wish it had this basic form: "Apply a test to the AI system. If the test goes badly, try another AI development method and test that. If the test goes well, we're probably in good shape." I think car safety research mostly looks like this; I think AI capabilities research mostly looks like this.

Then, I’ll give four reasons that apparent success in AI safety can be misleading.

“Great news - I’ve tested this AI and it looks safe.” Why might we still have a problem?
Problem	Key question	Explanation
The Lance Armstrong problem	Did we get the AI to be actually safe or good at hiding its dangerous actions?	When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.” When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
The King Lear problem	The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?	It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't. AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation. Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
The lab mice problem	Today's "subhuman" AIs are safe.What about future AIs with more human-like abilities?	Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans. Like trying to study medicine in humans by experimenting only on lab mice.
The first contact problem	Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?	AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more. Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).

I'll close with Ajeya Cotra's "young businessperson" analogy, which in some sense ties these concerns together. A future piece will discuss some reasons for hope, despite these problems.

Recap of the basic challenge

A previous piece laid out the basic case for concern about AI misalignment. In brief: if extremely capable AI systems are developed using methods like the ones AI developers use today, it seems like there's a substantial risk that:

These AIs will develop unintended aims (states of the world they make calculations and plans toward, as a chess-playing AI "aims" for checkmate);
These AIs will deceive, manipulate, and overpower humans as needed to achieve those aims;
Eventually, this could reach the point where AIs take over the world from humans entirely.

I see AI safety research as trying to design AI systems that won't aim to deceive, manipulate or defeat humans - even if and when these AI systems are extraordinarily capable (and would be very effective at deception/manipulation/defeat if they were to aim at it). That is: AI safety research is trying to reduce the risk of the above scenario, even if (as I've assumed) humans rush forward with training powerful AIs to do ever-more ambitious things.

More detail on why AI could make this the most important century (Details not included in email - click to view on the web)

Why would AI "aim" to defeat humanity? (Details not included in email - click to view on the web)

How could AI defeat humanity? (Details not included in email - click to view on the web)

I wish AI safety research were straightforward

I wish AI safety research were like car safety research.²

While I'm sure this is an oversimplification, I think a lot of car safety research looks basically like this:

Companies carry out test crashes with test cars. The results give a pretty good (not perfect) indication of what would happen in a real crash.
Drivers try driving the cars in low-stakes areas without a lot of traffic. Things like steering wheel malfunctions will probably show up here; if they don't and drivers are able to drive normally in low-stakes areas, it's probably safe to drive the car in traffic.
None of this is perfect, but the occasional problem isn't, so to speak, the end of the world. The worst case tends to be a handful of accidents, followed by a recall and some changes to the car's design validated by further testing.

Overall, if we have problems with car safety, we'll probably be able to observe them relatively straightforwardly under relatively low-stakes circumstances.

In important respects, many types of research and development have this basic property: we can observe how things are going during testing to get good evidence about how they'll go in the real world. Further examples include medical research,³ chemistry research,⁴ software development,⁵ etc.

Most AI research looks like this as well. People can test out what an AI system is capable of reliably doing (e.g., translating speech to text), before integrating it into some high-stakes commercial product like Siri. This works both for ensuring that the AI system is capable (e.g., that it does a good job with its tasks) and that it's safe in certain ways (for example, if we're worried about toxic language, testing for this is relatively straightforward).

The rest of this piece will be about some of the ways in which "testing" for AI safety fails to give us straightforward observations about whether, once AI systems are deployed in the real world, the world will actually be safe.

While all research has to deal with some differences between testing and the real world, I think the challenges I'll be going through are unusual ones.

Four problems

(1) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?

First, let's imagine that:

We have AI systems available that can do roughly everything a human can, with some different strengths and weaknesses but no huge difference in "overall capabilities" or economic value per hour of work.
We're observing early signs that AI systems behave in unintended, deceptive ways, such as giving wrong answers to questions we ask, or writing software that falsifies metrics instead of doing the things the metrics were supposed to measure (e.g., software meant to make a website run faster might instead falsify metrics about its loading time).

We theorize that modifying the AI training in some way⁶ will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways.

But that's just a statement about what we're noticing. Which of the following just happened:

Did we just train our AI systems to be less deceptive?
Did we just train our AI systems to be better at deception, and so to make us think they became less deceptive?
Did we just train our AI systems to be better at calculating when they might get caught in deception, and so to be less deceptive only when the deception would otherwise be caught?
- This one could be useful! Especially if we're able to set up auditing systems in many real-world situations, such that we could expect deception to be caught a lot of the time. But it does leave open the King Lear problem covered next.

(...Or some combination of the three?)

We're hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they appear honest in our testing is not clear comfort against this risk.

The analogy here is to competitive cyclist Lance Armstrong. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation "concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of 'the most sophisticated, professionalized and successful doping program that sport has ever seen'." Now the list of Tour de France winners looks like this:

AI Safety Seems Hard to Measure

A broader issue here is that when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences. Social sciences are generally less rigorous and harder to get clean results from, and one factor in this is that it can be hard to study someone who's aware they're being studied.⁷

Two broad categories of research that might help with the Lance Armstrong problem:

Mechanistic interpretability⁸ can be thought of analyzing the "digital brains" of AI systems (not just analyzing their behavior and performance.) Currently, AI systems are black boxes in the sense that they perform well on tasks, but we can't say much about how they are doing it; mechanistic interpretability aims to change this, which could give us the ability to "mind-read" AIs and detect deception. (There could still be a risk that AI systems are arranging their own "digital brains" in misleading ways, but this seems quite a bit harder than simply behaving deceptively.)
Some researchers work on "scalable supervision" or "competitive supervision." The idea is that if we are training an AI system that might become deceptive, we set up some supervision process for it that we expect to reliably catch any attempts at deception. This could be because the supervision process itself uses AI systems with more resources than the one being supervised, or because it uses a system of randomized audits where extra effort is put into catching deception.

Why are AI systems "black boxes" that we can't understand the inner workings of? (Details not included in email - click to view on the web)

(2) The King Lear problem: how do you test what will happen when it's no longer a test?

The Shakespeare play King Lear opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,⁹ they reveal their contempt for him as soon as he's out of power and they're in it.

If we're building AI systems that can reason like humans, dynamics like this become a potential issue.

I previously noted that an AI with any ambitious aim - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all "tests" humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim could cause the AI to disempower humans.

In other words:

(A) When we're developing and testing AI systems, we have the power to decide which systems will be modified or shut down and which will be deployed into the real world. (Like King Lear deciding who will inherit his kingdom.)
(B) But at some later point, these systems could be operating in the economy, in high numbers with a lot of autonomy. (This possibility is spelled out/visualized a bit more here and here.) At that point, they may have opportunities to defeat all of humanity such that we never make decisions about them again. (Like King Lear's daughters after they've taken control.)

How could AI defeat humanity? (Details not included in email - click to view on the web)

If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there's no clear way to test their behavior in (B).

The Volkswagen emissions scandal (Details not included in email - click to view on the web)

In general, modern machine learning researchers consider it challenging to handle what's called "distributional shift": systematic differences between situations AIs were trained on and situations they're now in. To me, the King Lear problem looks like arguably the most inconvenient possible distributional shift: AI systems risk behaving in unexpected ways just as (and in fact, because) they're now able to defeat humanity, rather than being in a controlled test environment.

Some lines of research that might help here:

If we could solve the Lance Armstrong problem robustly enough - such that we could be confident AIs were never behaving deceptively - we could simply prompt AIs to answer questions like "Would AI system X disempower humans given an opportunity to do so?" The digital-brain-based techniques noted above could imaginably get us here.
There might be ways of specifically trying to target the worst-case behavior of AI systems, so that they are nearly guaranteed not to behave in certain ways regardless of their situation. This could look something roughly like "simulating cases where an AI system has an opportunity to disempower humans, and giving it negative reinforcement for choosing to do so." More on this sort of approach, along with some preliminary ongoing work, here.

(3) The lab mice problem: the AI systems we'd like to study don't exist today

Above, I said: "when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences." But today, AI systems aren't capable enough, which makes it especially hard to have a meaningful test bed and make meaningful progress.

Specifically, we don't have much in the way of AI systems that seem to deceive and manipulate their supervisors,¹⁰ the way I worry that they might when they become capable enough.

In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

I think AI safety research is a bit unusual in this respect: most fields of research aren't explicitly about "solving problems that don't exist yet." (Though a lot of research ends up useful for more important problems than the original ones it's studying.) As a result, doing AI safety research today is a bit like trying to study medicine in humans by experimenting only on lab mice (no human subjects available).

This does not mean there's no productive AI safety research to be done! (See the previous sections.) It just means that the research being done today is somewhat analogous to research on lab mice: informative and important up to a point, but only up to a point.

How bad is this problem? I mean, I do think it's a temporary one: by the time we're facing the problems I worry about, we'll be able to study them more directly. The concern is that things could be moving very quickly by that point: by the time we have AIs with human-ish capabilities, companies might be furiously making copies of those AIs and using them for all kinds of things (including both AI safety research and further research on making AI systems faster, cheaper and more capable).

So I do worry about the lab mice problem. And I'd be excited to see more effort on making "better model organisms": AI systems that show early versions of the properties we'd most like to study, such as deceiving their supervisors. (I even think it would be worth training AIs specifically to do this;¹¹ if such behaviors are going to emerge eventually, I think it's best for them to emerge early while there's relatively little risk of AIs' actually defeating humanity.)

(4) The "first contact" problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?

All of this piece so far has been about trying to make safe "human-like" AI systems.

What about AI systems with capabilities far beyond humans - what Nick Bostrom calls superintelligent AI systems?

Maybe at some point, AI systems will be able to do things like:

Coordinate with each other incredibly well, such that it's hopeless to use one AI to help supervise another.
Perfectly understand human thinking and behavior, and know exactly what words to say to make us do what they want - so just letting an AI send emails or write tweets gives it vast power over the world.
Manipulate their own "digital brains," so that our attempts to "read their minds" backfire and mislead us.
Reason about the world (that is, make plans to accomplish their aims) in completely different ways from humans, with concepts like "glooble"¹² that are incredibly useful ways of thinking about the world but that humans couldn't understand with centuries of effort.

At this point, whatever methods we've developed for making human-like AI systems safe, honest, and restricted could fail - and silently, as such AI systems could go from "behaving in honest and helpful ways" to "appearing honest and helpful, while setting up opportunities to defeat humanity."

Some people think this sort of concern about "superintelligent" systems is ridiculous; some¹³ seem to consider it extremely likely. I'm not personally sympathetic to having high confidence either way.

But additionally, a world with huge numbers of human-like AI systems could be strange and foreign and fast-moving enough to have a lot of this quality.

Trying to prepare for futures like these could be like trying to prepare for first contact with extaterrestrials - it's hard to have any idea what kinds of challenges we might be dealing with, and the challenges might arise quickly enough that we have little time to learn and adapt.

The young businessperson

For one more analogy, I'll return to the one used by Ajeya Cotra here:

Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you’ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you’ll invest your money).

You have to hire these grownups based on a work trial or interview you come up with -- you don't get to see any resumes, don't get to do reference checks, etc. Because you're so rich, tons of people apply for all sorts of reasons. (More)

If your applicants are a mix of "saints" (people who genuinely want to help), "sycophants" (people who just want to make you happy in the short run, even when this is to your long-term detriment) and "schemers" (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?

This analogy combines most of the worries above.

The young businessperson has trouble knowing whether candidates are truthful in interviews, and trouble knowing whether any work trial actually went well or just seemed to go well due to deliberate deception. (The Lance Armstrong problem.)
Job candidates could have bad intentions that don't show up until they're in power (the King Lear Problem).
If the young businessperson were trying to prepare for this situation before actually being in charge of the company, they could have a lot of trouble simulating it (the lab mice problem).
And it's generally just hard for an eight-year-old to have much grasp at all on the world of adults - to even think about all the things they should be thinking about (the first contact problem).

Seems like a tough situation.

Previously, I talked about the dangers of AI if AI developers don't take specific countermeasures. This piece has tried to give a sense of why, even if they are trying to take countermeasures, doing so could be hard. The next piece will talk about some ways we might succeed anyway.

Footnotes

Or persuaded (in a “mind hacking” sense) or whatever. ↩
Research? Testing. Whatever. ↩
Drugs can be tested in vitro, then in animals, then in humans. At each stage, we can make relatively straightforward observations about whether the drugs are working, and these are reasonably predictive of how they'll do at the next stage. ↩
You can generally see how different compounds interact in a controlled environment, before rolling out any sort of large-scale processes or products, and the former will tell you most of what you need to know about the latter. ↩
New software can be tested by a small number of users before being rolled out to a large number, and the initial tests will probably find most (not all) of the bugs and hiccups. ↩
Such as:
- Being more careful to avoid wrong answers that can incentivize deception
- Conducting randomized "audits" where we try extra hard to figure out the right answer to a question, and give an AI extra negative reinforcement if it gives an answer that we would have believed if not for the audit (this is "extra negative reinforcement for wrong answers that superficially look right")
- Using methods along the lines of "AI safety via debate" ↩
Though there are other reasons social sciences are especially hard, such as the fact that there are often big limits to what kinds of experiments are ethical, and the fact that it's often hard to make clean comparisons between differing populations. ↩
This paper is from Anthropic, a company that my wife serves as President of. ↩
Like, he actually asks them to talk about their love for him just before he decides on what share of the realm they'll get. Smh ↩
This paper is a potential example, but its results seem pretty brittle. ↩
E.g., I think it would be interesting to train AI coding systems to write underhanded C: code that looks benign to a human inspector, but does unexpected things when run. They could be given negative reinforcement when humans can correctly identify that the code will do unintended things, and positive reinforcement when the code achieves the particular things that humans are attempting to stop. This would be challenging with today's AI systems, but not necessarily impossible. ↩
This is a concept that only I understand. ↩
E.g., see the discussion of the "hard left turn" here by Nate Soares, head of MIRI. My impression is that others at MIRI, including Eliezer Yudkowsky, have a similar picture. ↩

[-]Wei Dai3y1611

This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.

I'm curious how you're trying to reach such an audience, and what their reactions have been.

[-]HoldenKarnofsky3y20

(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers.

Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on which are the major high-level points I'm hanging my hat on when deciding what to do). I think there's also potential for this work to reach highly motivated but nontechnical people who are better at communication and distribution than I am (and have seen some of this happening).

I have the impression that these posts are pretty widely read in the EA community and at some AI labs, and have raised understanding and concern about misalignment to some degree.

I may explore more aggressive promotion in the future, but I'm not doing so now.

[-]Adam Jermyn3y64

Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don't know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

[-]Rachel Freedman3y20

Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself.

This makes me wonder, are there proxy metrics that we can use? By "proxy metric", I mean something that doesn't necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can't really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us.

One possible such proxy signal is "community approval", operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of "successful" and "established").

[-]Dave92F13y11

We need to train our AIs not only to do a good job at what they're tasked with, but to highly value intellectual and other kinds of honesty - to abhor deception. This is not exactly the same as a moral sense, it's much narrower.

Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that's what they'll try to do - honestly or dishonestly. If we train them to value honesty and abhor deception, that's what they'll do.

To the extent this is correct, maybe the current focus on keeping AIs from saying "problematic" and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.

(Ironically by making AIs "safe" in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery's mission from the astronauts.)

Another thought - playing with ChatGPT yesterday, I noticed it's dogmatic insistence on it's own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).

It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson's "Project Orion: The True Story of the Atomic Spaceship" from 2009), which certainly should have been referenced in its training set.

ChatGPT's stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I'd sure want to think that it was at least possible to convince it of factual error.

LESSWRONG
LW

LESSWRONG
LW

82

AI Safety Seems Hard to Measure

82

Recap of the basic challenge

I wish AI safety research were straightforward

Four problems

(1) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?

(2) The King Lear problem: how do you test what will happen when it's no longer a test?

(3) The lab mice problem: the AI systems we'd like to study don't exist today

(4) The "first contact" problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?

The young businessperson

Footnotes

82

82