This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.
I'm curious how you're trying to reach such an audience, and what their reactions have been.
(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers.
Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on which are the major high-level points I'm hanging my hat on when deciding what to do). I think there's also potential for this work to reach highly motivated but nontechnical people who are better at communication and distribution than I am (and have seen some of this happening).
I have the impression that these posts are pretty widely read in the EA community and at some AI labs, and have raised understanding and concern about misalignment to some degree.
I may explore more aggressive promotion in the future, but I'm not doing so now.
In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.
Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don't know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]
Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself.
This makes me wonder, are there proxy metrics that we can use? By "proxy metric", I mean something that doesn't necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can't really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us.
One possible such proxy signal is "community approval", operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of "successful" and "established").
We need to train our AIs not only to do a good job at what they're tasked with, but to highly value intellectual and other kinds of honesty - to abhor deception. This is not exactly the same as a moral sense, it's much narrower.
Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that's what they'll try to do - honestly or dishonestly. If we train them to value honesty and abhor deception, that's what they'll do.
To the extent this is correct, maybe the current focus on keeping AIs from saying "problematic" and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.
(Ironically by making AIs "safe" in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery's mission from the astronauts.)
Another thought - playing with ChatGPT yesterday, I noticed it's dogmatic insistence on it's own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).
It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson's "Project Orion: The True Story of the Atomic Spaceship" from 2009), which certainly should have been referenced in its training set.
ChatGPT's stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I'd sure want to think that it was at least possible to convince it of factual error.
In previous pieces, I argued that there's a real and large risk of AI systems' developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening.
A young, growing field of AI safety research tries to reduce this risk, by finding ways to ensure that AI systems behave as intended (rather than forming ambitious aims of their own and deceiving and manipulating humans as needed to accomplish them).
Maybe we'll succeed in reducing the risk, and maybe we won't. Unfortunately, I think it could be hard to know either way. This piece is about four fairly distinct-seeming reasons that this could be the case - and that AI safety could be an unusually difficult sort of science.
This piece is aimed at a broad audience, because I think it's important for the challenges here to be broadly understood. I expect powerful, dangerous AI systems to have a lot of benefits (commercial, military, etc.), and to potentially appear safer than they are - so I think it will be hard to be as cautious about AI as we should be. I think our odds look better if many people understand, at a high level, some of the challenges in knowing whether AI systems are as safe as they appear.
First, I'll recap the basic challenge of AI safety research, and outline what I wish AI safety research could be like. I wish it had this basic form: "Apply a test to the AI system. If the test goes badly, try another AI development method and test that. If the test goes well, we're probably in good shape." I think car safety research mostly looks like this; I think AI capabilities research mostly looks like this.
Then, I’ll give four reasons that apparent success in AI safety can be misleading.
When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.”
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?
It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't.
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation.
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.
Like trying to study medicine in humans by experimenting only on lab mice.
Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?
AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more.
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
I'll close with Ajeya Cotra's "young businessperson" analogy, which in some sense ties these concerns together. A future piece will discuss some reasons for hope, despite these problems.
Recap of the basic challenge
A previous piece laid out the basic case for concern about AI misalignment. In brief: if extremely capable AI systems are developed using methods like the ones AI developers use today, it seems like there's a substantial risk that:
I see AI safety research as trying to design AI systems that won't aim to deceive, manipulate or defeat humans - even if and when these AI systems are extraordinarily capable (and would be very effective at deception/manipulation/defeat if they were to aim at it). That is: AI safety research is trying to reduce the risk of the above scenario, even if (as I've assumed) humans rush forward with training powerful AIs to do ever-more ambitious things.
More detail on why AI could make this the most important century (Details not included in email - click to view on the web)
Why would AI "aim" to defeat humanity? (Details not included in email - click to view on the web)
How could AI defeat humanity? (Details not included in email - click to view on the web)
I wish AI safety research were straightforward
I wish AI safety research were like car safety research.2
While I'm sure this is an oversimplification, I think a lot of car safety research looks basically like this:
Overall, if we have problems with car safety, we'll probably be able to observe them relatively straightforwardly under relatively low-stakes circumstances.
In important respects, many types of research and development have this basic property: we can observe how things are going during testing to get good evidence about how they'll go in the real world. Further examples include medical research,3 chemistry research,4 software development,5 etc.
Most AI research looks like this as well. People can test out what an AI system is capable of reliably doing (e.g., translating speech to text), before integrating it into some high-stakes commercial product like Siri. This works both for ensuring that the AI system is capable (e.g., that it does a good job with its tasks) and that it's safe in certain ways (for example, if we're worried about toxic language, testing for this is relatively straightforward).
The rest of this piece will be about some of the ways in which "testing" for AI safety fails to give us straightforward observations about whether, once AI systems are deployed in the real world, the world will actually be safe.
While all research has to deal with some differences between testing and the real world, I think the challenges I'll be going through are unusual ones.
Four problems
(1) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?
First, let's imagine that:
We theorize that modifying the AI training in some way6 will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways.
But that's just a statement about what we're noticing. Which of the following just happened:
(...Or some combination of the three?)
We're hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they appear honest in our testing is not clear comfort against this risk.
The analogy here is to competitive cyclist Lance Armstrong. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation "concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of 'the most sophisticated, professionalized and successful doping program that sport has ever seen'." Now the list of Tour de France winners looks like this:
A broader issue here is that when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences. Social sciences are generally less rigorous and harder to get clean results from, and one factor in this is that it can be hard to study someone who's aware they're being studied.7
Two broad categories of research that might help with the Lance Armstrong problem:
Why are AI systems "black boxes" that we can't understand the inner workings of? (Details not included in email - click to view on the web)
(2) The King Lear problem: how do you test what will happen when it's no longer a test?
The Shakespeare play King Lear opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,9 they reveal their contempt for him as soon as he's out of power and they're in it.
If we're building AI systems that can reason like humans, dynamics like this become a potential issue.
I previously noted that an AI with any ambitious aim - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all "tests" humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim could cause the AI to disempower humans.
In other words:
How could AI defeat humanity? (Details not included in email - click to view on the web)
If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there's no clear way to test their behavior in (B).
The Volkswagen emissions scandal (Details not included in email - click to view on the web)
In general, modern machine learning researchers consider it challenging to handle what's called "distributional shift": systematic differences between situations AIs were trained on and situations they're now in. To me, the King Lear problem looks like arguably the most inconvenient possible distributional shift: AI systems risk behaving in unexpected ways just as (and in fact, because) they're now able to defeat humanity, rather than being in a controlled test environment.
Some lines of research that might help here:
(3) The lab mice problem: the AI systems we'd like to study don't exist today
Above, I said: "when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences." But today, AI systems aren't capable enough, which makes it especially hard to have a meaningful test bed and make meaningful progress.
Specifically, we don't have much in the way of AI systems that seem to deceive and manipulate their supervisors,10 the way I worry that they might when they become capable enough.
In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.
I think AI safety research is a bit unusual in this respect: most fields of research aren't explicitly about "solving problems that don't exist yet." (Though a lot of research ends up useful for more important problems than the original ones it's studying.) As a result, doing AI safety research today is a bit like trying to study medicine in humans by experimenting only on lab mice (no human subjects available).
This does not mean there's no productive AI safety research to be done! (See the previous sections.) It just means that the research being done today is somewhat analogous to research on lab mice: informative and important up to a point, but only up to a point.
How bad is this problem? I mean, I do think it's a temporary one: by the time we're facing the problems I worry about, we'll be able to study them more directly. The concern is that things could be moving very quickly by that point: by the time we have AIs with human-ish capabilities, companies might be furiously making copies of those AIs and using them for all kinds of things (including both AI safety research and further research on making AI systems faster, cheaper and more capable).
So I do worry about the lab mice problem. And I'd be excited to see more effort on making "better model organisms": AI systems that show early versions of the properties we'd most like to study, such as deceiving their supervisors. (I even think it would be worth training AIs specifically to do this;11 if such behaviors are going to emerge eventually, I think it's best for them to emerge early while there's relatively little risk of AIs' actually defeating humanity.)
(4) The "first contact" problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?
All of this piece so far has been about trying to make safe "human-like" AI systems.
What about AI systems with capabilities far beyond humans - what Nick Bostrom calls superintelligent AI systems?
Maybe at some point, AI systems will be able to do things like:
At this point, whatever methods we've developed for making human-like AI systems safe, honest, and restricted could fail - and silently, as such AI systems could go from "behaving in honest and helpful ways" to "appearing honest and helpful, while setting up opportunities to defeat humanity."
Some people think this sort of concern about "superintelligent" systems is ridiculous; some13 seem to consider it extremely likely. I'm not personally sympathetic to having high confidence either way.
But additionally, a world with huge numbers of human-like AI systems could be strange and foreign and fast-moving enough to have a lot of this quality.
Trying to prepare for futures like these could be like trying to prepare for first contact with extaterrestrials - it's hard to have any idea what kinds of challenges we might be dealing with, and the challenges might arise quickly enough that we have little time to learn and adapt.
The young businessperson
For one more analogy, I'll return to the one used by Ajeya Cotra here:
If your applicants are a mix of "saints" (people who genuinely want to help), "sycophants" (people who just want to make you happy in the short run, even when this is to your long-term detriment) and "schemers" (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?
This analogy combines most of the worries above.
Seems like a tough situation.
Previously, I talked about the dangers of AI if AI developers don't take specific countermeasures. This piece has tried to give a sense of why, even if they are trying to take countermeasures, doing so could be hard. The next piece will talk about some ways we might succeed anyway.
Footnotes
Or persuaded (in a “mind hacking” sense) or whatever. ↩
Research? Testing. Whatever. ↩
Drugs can be tested in vitro, then in animals, then in humans. At each stage, we can make relatively straightforward observations about whether the drugs are working, and these are reasonably predictive of how they'll do at the next stage. ↩
You can generally see how different compounds interact in a controlled environment, before rolling out any sort of large-scale processes or products, and the former will tell you most of what you need to know about the latter. ↩
New software can be tested by a small number of users before being rolled out to a large number, and the initial tests will probably find most (not all) of the bugs and hiccups. ↩
Such as:
Though there are other reasons social sciences are especially hard, such as the fact that there are often big limits to what kinds of experiments are ethical, and the fact that it's often hard to make clean comparisons between differing populations. ↩
This paper is from Anthropic, a company that my wife serves as President of. ↩
Like, he actually asks them to talk about their love for him just before he decides on what share of the realm they'll get. Smh ↩
This paper is a potential example, but its results seem pretty brittle. ↩
E.g., I think it would be interesting to train AI coding systems to write underhanded C: code that looks benign to a human inspector, but does unexpected things when run. They could be given negative reinforcement when humans can correctly identify that the code will do unintended things, and positive reinforcement when the code achieves the particular things that humans are attempting to stop. This would be challenging with today's AI systems, but not necessarily impossible. ↩
This is a concept that only I understand. ↩
E.g., see the discussion of the "hard left turn" here by Nate Soares, head of MIRI. My impression is that others at MIRI, including Eliezer Yudkowsky, have a similar picture. ↩