Sam Altman, CEO of OpenAI, was interviewed by Connie Loizos last week and the video was posted two days ago. Here are some AI safety-relevant parts of the discussion, with light editing by me for clarity, based on this automated transcript

[starting in part two of the interview, which is where the discussion about AI safety is] 

Connie: So moving on to AI which is where you've obviously spent the bulk of your time since I saw you when we sat here three years ago. You were telling us what was coming and we all thought you were being sort of hyperbolic and you were dead serious. Why do you think that ChatGPT and DALL-E so surprised people?

Sam: I genuinely don't know. I've reflected on it a lot. We had the model for ChatGPT in the API for I don't know 10 months or something before we made ChatGPT. And I sort of thought someone was going to just build it or whatever and that enough people had played around with it. Definitely, if you make a really good user experience on top of something. One thing that I very deeply believed was the way people wanted to interact with these models was via dialogue. We kept telling people this we kept trying to get people to build it and people wouldn't quite do it. So we finally said all right we're just going to do it, but yeah I think the pieces were there for a while. 

One of the reasons I think DALL-E surprised people is if you asked five or seven years ago, the kind of ironclad wisdom on AI was that first, it comes for physical labor, truck driving, working in the factory, then this sort of less demanding cognitive labor, then the really demanding cognitive labor like computer programming, and then very last of all or maybe never because maybe it's like some deep human special sauce was creativity. And of course, we can look now and say it really looks like it's going to go exactly the opposite direction. But I think that is not super intuitive and so I can see why DALL-E surprised people. But I genuinely felt somewhat confused about why ChatGPT did.

One of the things we really believe is that the most responsible way to put this out in society is very gradually and to get people, institutions, policy makers, get them familiar with it, thinking about the implications, feeling the technology, and getting a sense for what it can do and can't do very early. Rather than drop a super powerful AGI in the world all at once. And so we put GPT3 out almost three years ago and then we put it into an API like two and a half years ago. And the incremental update from that to ChatGPT I felt should have been predictable and I want to do more introspection on why I was sort of miscalibrated on that. 

Connie: So you know you had talked when you were here about releasing things in a responsible way. What gave you the confidence to release what you have released already? I mean do you think we're ready for it? Are there enough guardrails in place? 

Sam: We do have an internal process where we try to break things in and study impacts. We use external auditors, we have external red teamers, we work with other labs, and have safety organizations look at stuff. 

Societal changes that ChatGPT is going to cause or is causing. There's a big one going now about the impact of this on education, academic integrity, all of that. But starting these now where the stakes are still relatively low, rather than just putting out what the whole industry will have in a few years with no time for society to update, I think would be bad. Covid did show us for better or for worse that society can update to massive changes sort of faster than I would have thought in many ways. 

But I still think given the magnitude of the economic impact we expect here more gradual is better and so putting out a very weak and imperfect system like ChatGPT and then making it a little better this year a little better later this year a little better next year, that seems much better than the alternative. 

Connie: Can you comment on whether GPT4 is coming out in the first quarter, first half of the year? 

Sam: It'll come out at some point when we are confident that we can do it safely and responsibly. I think in general we are going to release technology much more slowly than people would like. We're going to sit on it for much longer than people would like. And eventually, people will be happy with our approach to this, but at the time I realize people want the shiny toy and it's frustrating. I totally get that. 

Connie: I saw a visual and I don't know if it was accurate but it showed GPT 3.5 versus I guess what GPT4 is expected and I saw that thing on Twitter... 

Sam: The GPT4 rumor mill is like a ridiculous thing I don't know where it all comes from. I don't know why people don't have like better things to speculate on. I get a little bit of it like it's sort of fun but that it's been going for like six months at this volume. People are begging to be disappointed and they will be. The hype is just like... we don't have an actual AGI and I think that's sort of what is expected of us and yeah we're going to disappoint those people. 

[skipping part] 

Connie: You obviously have figured out a way to make some revenue, you're licensing your model...

Sam: We're very early. 

Connie: So right now licensing to startups you are early on and people are sort of looking at the whole of what's happening out there and they're saying you've got Google which could potentially release things this year, you have a lot of AI upstarts nipping at your heels, are you worried about what you're building being commoditized?

Sam: To some degree, I hope it is. The future I would like to see is where access to AI is super democratized, where there are several AGIs in the world that can help allow for multiple viewpoints and not have anyone get too powerful. Like the cost of intelligence and energy, because it gets commoditized, trends down and down and down, and the massive surplus there, access to the systems, and eventually governance of the systems benefits all of us. So yeah I sort of hope that happens. I think competition is good at least until we get to AGI. I deeply believe in capitalism and competition to offer the best service at the lowest price, but that's not great from a business standpoint. That's fine, we'll be fine. 

Connie: I also find it interesting that you say differing viewpoints or these AGIs would have different viewpoints. They're all being trained on all the data that's available in the world, so how do we come up with differing viewpoints? 

Sam: I think what is going to have to happen is society will have to agree and set some laws on what an AGI can never do, or what one of these systems should never do. And one of the cool things about the path of the technology tree that we're on, which is very different than before we came along, and it was sort of Deepmind having these games that were like you know having agents play each other and try to deceive each other and kill each other, and all that, which I think could have gone in a bad direction. We know these language models that can understand language, and so we can say, hey model here's what we'd like you to do, here are the values we'd like you to align to. And we don't have it working perfectly yet, but it works a little and it'll get better and better. 

And the world can say all right, here are the rules, here's the very broad-bounds absolute rules of a system. But within that people should be allowed very different things that they want their AI to do. And so if you want the super never offend safe for work model you should get that. And if you want an edgier one that is sort of creative and exploratory but says some stuff you might not be comfortable with or some people might not be comfortable with, you should get that.

I think there will be many systems in the world that have different settings of the values that they enforce. And really what I think -- and this will take longer -- is that you as a user should be able to write up a few pages of here's what I want here are my values here's how I want the AI to behave and it reads it and thinks about it and acts exactly how you want. Because it should be your AI and it should be there to serve you and do the things you believe in. So that to me is much better than one system where one tech company says here are the rules. 

Connie: When we sat down it was right before your partnership with Microsoft, so when you say we're gonna be okay I wonder...

Sam: We're just going to build a fine business. Even if the competitive pressure pushes the price that people will pay per token down, we're gonna do fine. We also have this cap profit model so we don't have this incentive to just like capture all of the infinite dollars out there anyway. To generate enough money for our equity structure, yeah I believe we'll be fine. 

Connie: Well I know you're not crazy about talking about deal-making so we won't, but can you talk a little bit about your partnership with Microsoft, and how it's going?

Sam: It's great. They're the only tech company out there that I think I'd be excited to partner with this deeply. I think Satya is an amazing CEO but more than that human being and understands -- so do Kevin Scott and McHale [?] who we work with closely as well -- understand the stakes of what AGI means and why we need to have all the weirdness we do in our structure and our agreement with them. So I really feel like it's a very values-aligned company. And there's some things they're very good at like building very large supercomputers and the infrastructure we operate on and putting the technology into products, there's things we're very good at like doing research. 

[skipping part]

Connie: Your pact with Microsoft, does it preclude you from building software and services? 

Sam: No -- we built -- I mean we just as we talked about ChatGPT3. We have lots more cool stuff coming.

Connie: What about other partnerships other than with Microsoft?

Sam: Also fine. Yeah, in general, we are very much here to build AGI. Products and services are tactics in service of that. Partnerships too but important ones. We really want to be useful to people. I think if we just build this in a lab and don't figure out how to get out into the world that's -- like somehow -- we're really falling short there. 

Connie: Well I wondered what you made of the fact that Google has said to its employees it's too imperfect, it could harm our reputation, we're not ready?

Sam: I hope when they launch something anyway you really hold them to that comment. I'll just leave it there. 

[skipping part] 

Connie: I also wanted to ask about Anthropic, a rival I guess, founded by a former...?

Sam: I think super highly of those people. Very very talented. And multiple AGIs in the world I think is better than one. 

Connie: Well what I was going to ask and just for some background it was founded by a former open AI VP of research who you I think met when he was at Google. But it is stressing an ethical layer as a kind of distinction from other players. And I just wondered if you think that systems should adopt a kind of a common code of principles and also whether that should be regulated? 

Sam: Yeah that was my earlier point, I think society should regulate what the wide bounds are, but then I think individual users should have a huge amount of liberty to decide how they want their experience to go. So I think it is like a combination of society -- you know there are a few asterisks on the free speech rules -- and society has decided free speech is not quite absolute. I think society will also decide language models are not quite absolute. But there is a lot of speech that is legal that you find distasteful, that I find distasteful, that he finds distasteful, and we all probably have somewhat different definitions of that, and I think it is very important that that is left to the responsibility of individual users and groups. Not one company. And that the government, they govern, and not dictate all of the rules.

[skipping part]

Question from an audience member: What is your best case scenario for AI and worst case? Or more pointedly what would you like to see and what would you not like to see out of AI in the future? 

Sam: I think the best case is so unbelievably good that it's hard to -- it's like hard for me to even imagine. I can sort of. I can sort of think about what it's like when we make more progress of discovering new knowledge with these systems than humanity has done so far, but like in a year instead of seventy thousand. I can sort of imagine what it's like when we kind of like launch probes out to the whole universe and find out really everything going on out there. I can sort of imagine what it's like when we have just like unbelievable abundance and systems that can help us resolve deadlocks and improve all aspects of reality and let us all live our best lives. But I can't quite. I think the good case is just so unbelievably good that you sound like a really crazy person to start talking about it. 

And the bad case -- and I think this is important to say -- is like lights out for all of us. I'm more worried about an accidental misuse case in the short term where someone gets a super powerful -- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like. But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. 

But I think it's more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as in orthogonal vectors. You're bad if you're a capabilities researcher and you're good if you're an alignment researcher. It actually sounds very reasonable, but they're almost the same thing. Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work

Connie: Alfred Lynn [?] told me to ask you, and I was going to ask anyway, how far away do you think AGI is? He said Sam will probably tell you sooner than you thought. 

Sam: The closer we get, the harder time I have answering because I think that it's going to be much blurrier, and much more of a gradual transition than people think. If you imagine a two-by-two matrix of short timelines until the AGI takeoff era begins and long timelines until it begins and then a slow takeoff or a fast takeoff. The world I think we're heading to and the safest world, the one I most hope for, is the short timeline slow takeoff. But I think people are going to have hugely different opinions about when and where you kind of like declare victory on the AGI thing. 

[skipping part] 

Question from an audience member: So given your experience with OpenAI safety and the conversation around it, how do you think about safety and other AI fields like autonomous vehicles?

Sam: I think there's like a bunch of safety issues for any new technology and particularly any narrow vertical of AI. We have learned a lot in the past seven or eight decades of technological progress about how to do really good safety engineering and safety systems management. And a lot of that about how we learn how to build safe systems and safe processes will translate, imperfect, there will be mistakes, but we know how to do that. 

I think the AGI safety stuff is really different, personally. And worthy of study as its own category. Because the stakes are so high and the irreversible situations are so easy to imagine we do need to somehow treat that differently and figure out a new set of safety processes and standards. 

[there's a bit more but it's not about AI safety] 

New Comment


42 comments, sorted by Click to highlight new comments since:

-- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like.

I think Sam Altman is "inventing a guy to be mad at" here. Who anthropomorphizes models?

 

And the bad case -- and I think this is important to say -- is like lights out for all of us. (..) But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. 

This reinforces my position that the fundamental dispute between the opposing segments of the AI safety landscape is based mainly on how hard it is to prevent extreme accidents, rather than on irreconcilable value differences. Of course, I can't judge who is right, and there might be quite a lot of uncertainty until shortly before very transformative events are possible.

On the one hand, I do think people around here say a lot of stuff that feels really silly to me, some of which definitely comes from analogies to humans, so I can sympathize with where Sam is coming from.

On the other hand, I think this response mischaracterizes the misalignment concern and is generally dismissive and annoying. Implying that "if you think an AI might behave badly, that really shows that it is you who would behave badly" is kind of rhetorically effective (and it is a non-zero signal) but it's a tiny consideration and either misunderstands the issues or is deliberately obtuse to score rhetorical points. It would be really worrying if people doubled down on this kind of rhetorical strategy (which I think is plausible) or if it was generally absorbed as part of the culture of OpenAI. Unfortunately some other OpenAI have made similarly worrying statements.

I agree that it's not obvious what is right. I think there is maybe a 50% chance that the alignment concerns are totally overblown and either emerge way too late to be relevant or are extremely easily dealt with. I hope that it will be possible to make measurements to resolve this dispute well before something catastrophic happens, and I do think there are plausible angles for doing so. In the meantime I personally just feel pretty annoyed at people on both sides who seem so confident and dismissive. I'm more frustrated at Eliezer because he is in some sense "on my side" of this issue, but I'm more worried about Sam since erring in the other direction would irreversibly disempower humanity.

That said, I agree with Sam that in the short term more of the harm comes from misuse than misalignment. I just think the "short term" could be quite short, and normal people are not myopic enough that the costs of misuse are comparable to say a 3% risk of death in 10 years. I also think "misuse" vs "misalignment" can be blurry in a way that makes both positions more defensible, e.g. a scenario where OpenAI trains a model which is stolen and then deployed recklessly can involve both. Misalignment is what makes that event catastrophic for humanity, but from OpenAI's perspective any event where someone steals their model and applies it recklessly might be described as misuse.

Sam: I genuinely don't know. I've reflected on it a lot. We had the model for ChatGPT in the API for I don't know 10 months or something before we made ChatGPT. And I sort of thought someone was going to just build it or whatever and that enough people had played around with it. Definitely, if you make a really good user experience on top of something. One thing that I very deeply believed was the way people wanted to interact with these models was via dialogue. We kept telling people this we kept trying to get people to build it and people wouldn't quite do it. So we finally said all right we're just going to do it, but yeah I think the pieces were there for a while.

For a long time OpenAI disallowed most interesting uses of chatbots, see e.g. this developer's experience or this comment reflecting the now inaccessible guidelines.

It feels strange hearing Sam say that their products are released whenever the feel as though 'society is ready.' Perhaps they can afford to do that now, but I cannot help but think that market dynamics will inevitably create strong incentives for race conditions very quickly (perhaps it is already happening) which will make following this approach pretty hard. I know he later says that he hopes for competition in the AI-space until the point of AGI, but I don't see how he balances the knowledge of extreme competition with the hope that society is prepared for the technologies they release; it seems that even current models, which appear to be far from the capabilities of AGI, are already transformative.

A few comments: 

  1. A lot of slow takeoff, gradual capabilities ramp-up, multipolar AGI world type of thinking. Personally, I agree with him this sort of scenario seems both more desirable and more likely. But this seems to be his biggest area of disagreement with many others here. 
  2. The biggest surprise to me was when he said that he thought short timelines were safer than long timelines. The reason for that is not obvious to me. Maybe something to do with contingent geopolitics. 
  3. Doesn't seem great to dismiss people's views based on psychologizing about them. But, these are off-the-cuff remarks, held to a lower standard than writing. 

The biggest surprise to me was when he said that he thought short timelines were safer than long timelines. The reason for that is not obvious to me. Maybe something to do with contingent geopolitics.

What do you expect him to say? "Yeah, longer timelines and consolidated AGI development efforts are great, I'm shorting your life expectancies as we speak"? The only way you can be a Sam Altman is by convincing yourself that nuclear proliferation makes the world safer.

Good point. 

I know your question was probably just rhetorical, but to answer it regardless -- I was confused in part because it would have made sense to me if he had said it would "better" if AGI timelines were short. 

Lots of people want short AGI timelines because they think the alignment problem will be easy or otherwise aren't concerned about it and they want the perceived benefits of AGI for themselves/their family and friends/humanity (eg eliminating disease, eliminating involuntary death, abundance, etc). And he could have just said "better" without really changing the rest of his argument. 

At least the word "better" would make sense to me, even if, as you imply, it might be wrong and plenty of others would disagree with it. 

So I expect I am missing something in his internal model that made him use the word "safer" instead of "better". I can only guess at possibilities. Like thinking that if AGI timelines are too long, then the CCP might take over the USA/the West in AI capabilities, and care even less about AGI safety when it matters the most. 

A lot of slow takeoff, gradual capabilities ramp-up, multipolar AGI world type of thinking. Personally, I agree with him this sort of scenario seems both more desirable and more likely.

I think the operative word in "seems more likely" here is "seems". It seems like a more sophisticated, more realistic, more modern and satisfyingly nuanced view, compared to "the very first AGI we train explodes like a nuclear bomb and unilaterally sets the atmosphere on fire, killing everyone instantly". The latter seems like an old view, a boringly simplistic retrofuturistic plot. It feels like there's a relationship between these two scenarios, and that the latter one is a rough first-order approximation someone lifted out of e. g. The Terminator to get people interested in the whole "AI apocalypse" idea at the onset of it all. Then we gained a better understanding, sketched out detailed possibilities that take into account how AI and AI research actually work in practice, and refined that rough scenario. As the result, we got that picture of a slower multipolar catastrophe.

A pleasingly complicated view! One that respectfully takes into account all of these complicated systems of society and stuff. It sure feels like how these things work in real life! "It's not like the AI wakes up and decides to be evil," perish the thought.

That seeming has very little to do with reality. The unilateral-explosion isn't the old, outdated scenario — it's simply a different scenarios that's operating on a different model of how intelligence explosions proceed. And as far as its proponents are concerned, its arguments haven't been overturned at all, and nothing about how DL works rules it out.

But it sure seems like the rough naive view that the Real Experts have grown out of a while ago; and that those who refuse to update simply haven't done that growing-up, haven't realized there's a world outside their chosen field with all these Complicated Factors you need to take into account.

It makes it pretty hard to argue against. It's so low-status.

... At least, that's how that argument feels to me, on a social level.

(Edit: Uh, to be clear, I'm not saying that there's no other reasons to buy the multipolar scenario except "it seems shiny"; that a reasonable person could not come to believe it for valid reasons. I think it's incorrect, and that there are some properties that unfairly advantage it in the social context, but I'm not saying it's totally illegitimate.)

the very first AGI we train explodes like a nuclear bomb and unilaterally sets the atmosphere on fire, killing everyone instantly

To understand whether this is the kind of thing that could be true or false, it seems like you should say what  "the very first AGI" means. What makes a system an AGI?

I feel like this view is gradually looking less plausible as we build increasingly intelligent and general systems, and they persistently don't explode (though only gradually because it's unclear what the view means).

It looks to me like what is going to happen is that AI systems will gradually get better at R&D. They can help with R&D by some unknown combination of complementing and replacing human labor, either way leading to acceleration. The key question is the timeline for that acceleration---how long between "AIs are good enough at R&D that their help increases the rate of progress (on average, over relevant domains) by more than doubling the speed of human labor would" and "dyson sphere." I feel that plausible views range from 2 months to 20 years, based on quantitative questions of returns curves and complementarity between AI and humans and the importance of capital. I'd overall guess 2-8 years.

Yeah, I think there's a sharp-ish discontinuity at the point where we get to AGI. "General intelligence" is, to wit, general — it implements some cognition that can efficiently derive novel heuristics for solving any problem/navigating arbitrary novel problem domains. And a system that can't do that is, well, not an AGI.

Conceptually, the distinction between an AGI and a pre-AGI system feels similar to the distinction between a system that's Turing-complete and one that isn't:

  • Any Turing-complete system implements a set of rules that suffices to represent any mathematical structure/run any program. A system that's just "slightly below" Turing-completeness, however, is dramatically more limited.
  • Similarly, an AGI has a complete set of some cognitive features that make it truly universal — features it can use to bootstrap any other capability it needs, from scratch. By contrast, even a slightly "pre-AGI" system would be qualitatively inferior, not simply quantitatively so.

There's still some fuzziness around the edges, like whether any significantly useful R&D capabilities only happen at post-AGI cognition, or to what extent being an AGI is the sufficient and not only the necessary condition for an omnicidal explosion.

But I do think there's a meaningful sense in which AGI-ness is a binary, not a continuum. (I'm also hopeful regarding nailing all of this down mathematically, instead of just vaguely gesturing at it like this.)

It seems like a human is universal in the sense that they can think about new problem solving strategies, evaluate them, adopt successful ones, etc. Most of those new problem-solving strategies are developed by long trial and error and cultural imitation of successful strategies.

If a language model could do the same thing with chain of thought, would you say that it is an AGI? So would the existence of such systems, without an immediate intelligence explosion, falsify your view?

If such a system seemed intuitively universal but wasn't exploding, what kind of observation would tell you that it isn't universal after all, and therefore salvage your claim? Or maybe to put this more sharply: how did you decide that text-davinci-003, prompted to pursue an open-ended goal and given the ability to instantiate and delegate to new copies of itself, isn't an AGI?

It seems to me that you will probably have an "AGI" in the sense you are gesturing at here well before the beginning of explosive R&D growth. I don't really see the argument for why an intelligence explosion would follow quickly from that point. (Indeed, my view is that you could probably build an AGI in this sense out of text-davinci-003, though the result would be uneconomical.)

If such a system seemed intuitively universal but wasn't exploding, what kind of observation would tell you that it isn't universal after all, and therefore salvage your claim?

Given that we don't understand how current LLMs work, and how the "space of problems" generally looks like, it's difficult to come up with concrete tests that I'm confident I won't goalpost-move on. A prospective one might be something like this:

If you invent or find a board game of similar complexity to chess that [the ML model] has never seen before and explain the rules using only text (and, if [the ML model] is multimodal, also images), [a pre-AGI model] will not be able to perform as well at the game as an average human who has never seen the game before and is learning it for the first time in the same way.

I. e., an AGI would be able to learn problem-solving in completely novel, "basically-off-distribution" domains. And if a system that has capabilities like this doesn't explode (and it's not deliberately trained to be myopic or something), that would falsify my view.

But for me to be confident in putting weight on that test, we'd need to clarify some specific details about the "minimum level of complexity" of the new board game, and that it's "different enough" from all known board games for the AI to be unable to just generalize from them... And given that it's unclear in which directions it's easy to generalize, I expect I wouldn't be confident in any metric we'd be able to come up with.

I guess a sufficient condition for AGI would be "is able to invent a new scientific field out of whole cloth, with no human steering, as an instrumental goal towards solving some other task". But that's obviously an overly high bar.

As far as empirical tests for AGI-ness go, I'm hoping for interpretability-based ones instead. I. e., that we're able to formalize what "general intelligence" means, then search for search in our models.

As far as my epistemic position, I expect three scenarios here:

  • We develop powerful interpretability tools, and they directly show whether my claims about general intelligence hold.
  • We don't develop powerful interpretability tools before an AGI explodes like I fear and kills us all.
  • AI capabilities gradually improve until we get to scientific-field-inventing AGIs, and my mind changes only then.

In scenarios where I'm wrong, I mostly don't expect to ever encounter any black-boxy test like you're suggesting which seems convincing to me, before I encounter overwhelming evidence that makes convincing me a moot point.

(Which is not to say my position on this can't be moved at all — I'm open to mechanical arguments about why cognition doesn't work how I'm saying it does, etc. But I don't expect to ever observe an LLM whose mix of capabilities and incapabilities makes me go "oh, I guess that meets my minimal AGI standard but it doesn't explode, guess I was wrong on that".)

You're right that the operative word in "seems more likely" is "seems"! I used the word "seems" because I find this whole topic really confusing and I have a lot of uncertainty. 

It sounds like there may be a concern that I am using the absurdity heuristic or something similar against the idea of fast take-off and associated AI apocalypse. Just to be clear, I most certainly do not buy absurdity heuristic arguments in this space, would not use them, and find them extremely annoying. We've never seen anything like AI before, so our intuition (which might suggest that the situation seems absurd) is liable to be very wrong. 

Oh, I think I should've made clearer that I wasn't aiming that rant at you specifically. Just outlining my general impression of how the two views feel socially.

Deep learning is just gonna solve all of these problems...

- Sam Altman

In context:

Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work

This is a pretty common view. I think there are three aspects here; two I agree with, and the other seems like a a kind of incoherent argument that is used as a justification for bad behavior.

First the one that I disagree with: I'm most concerned about the risk from AI systems that are "trying" to cause harm. It's true that more capable systems are more able to understand that humans wouldn't want them to do that, and are more able to act as effective checks and balances on other AI, and so on. But that doesn't mean that making AIs more capable improves the situation with respect to safety. 

From this perspective, the relevant question is how safe AI systems are at the point when they have given capabilities. If smart AI systems will understand that they shouldn't kill you, then you don't need to worry about smart AI systems kill you. Making the AI systems smarter faster doesn't help, because the dumb AI systems definitely weren't going to kill you. You could imagine a regime where AI systems are smart enough to kill you but getting less dangerous as they get smarter (since they e.g. better defend against other AIs, or can realize that humans didn't want to be murdered), but that's obviously not where we are at now, and there aren't really plausible stories where that happens.

Two other versions of this claim that I do agree with:

  • A bunch of stuff that helps with safety also helps with making AI systems more useful and arguably more "capable," and some people in the safety community will say "that stuff is net negative, our overwhelming priority is delaying AI development." Salient examples are robustness and RLHF. I think following the implied strategy---of avoiding any safety work that improves capabilities ("capability externalities")---would be a bad idea. You would give up on realistic approaches for safety in return for small delays in adoption, the cost-benefit analysis just doesn't seem to work out or even be close.
  • There are significant (but mostly non-catastrophic) risks from AI errors. A smarter AI is less likely to accidentally start a war by misunderstanding the strategic situation, or crash a car, or accidentally cause a nuclear meltdown. Making smarter AI increases adoption of AI, but decreases the risk of such errors, and so in some sense faster progress seems very good for "safety." The problem is that this is not a big part of the potential harms from OpenAI's research.

Salient examples are robustness and RLHF. I think following the implied strategy---of avoiding any safety work that improves capabilities ("capability externalities")---would be a bad idea.

There are plenty of topics in robustness, monitoring, and alignment that improve safety differentially without improving vanilla upstream accuracy: most adversarial robustness research does not have general capabilities externalities; topics such as transparency, trojans, and anomaly detection do not; honesty efforts so far do not have externalities either. Here is analysis of many research areas and their externalities.

Even though the underlying goal is to improve the safety-capabilities ratio, this is not the best decision-making policy. Given uncertainty, the large incentives for making models superhuman, motivated reasoning, and competition pressures, aiming for minimal general capabilities externalities should be what influences real-world decision-making (playing on the criterion of rightness vs. decision procedure distinction).

If safety efforts are to scale to a large number of researchers, the explicit goal should be to measurably avoid general capabilities externalities rather than, say, "pursue particular general capabilities if you expect that it will help reduce risk down the line," though perhaps I'm just particularly risk-averse. Without putting substantial effort in finding out how to avoid externalities, the differentiation between safety and capabilities at many places is highly eroded, and in consequence some alignment teams are substantially hastening timelines. For example, an alignment team's InstructGPT efforts were instrumental in making ChatGPT arrive far earlier than it would have otherwise, which is causing Google to become substantially more competitive in AI and causing many billions to suddenly flow into different AGI efforts. This is decisively hastening the onset of x-risks. I think minimal externalities may be a standard that is not always met, but I think it should be more strongly incentivized.

I agree that some forms of robustness research don't have capabilities externalities, but the unreliability of ML systems is a major blocker to many applications. So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.

I disagree even more strongly with "honesty efforts don't have externalities:" AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.

I agree that interpretability doesn't always have big capabilities externalities, but it's often far from zero. Work that sheds meaningful light on what models are actually doing internally seems particularly likely to have such externalities.

In general I think "solve problems that actually exist" is a big part of how the ML community is likely to make progress, and many kinds of safety progress will be addressing problems that people care about today and hence have this kind of capabilities externality.

The safety-capabilities ratio is the criterion of rightness

I think 1 unit of safety and 0 units of capabilities is worse than 10 units of safety and 1 unit of capabilities (where a unit is something taking similar labor to uncover), I think it's more like (safety progress) - X * (timelines acceleration) for some tradeoff rate X.

Ultimately this seems like it should be a quantitative discussion, and so far from the safety community I'm just not seeing reasonable-looking botecs supporting an emphasis on capabilities externalities. I'm not a very ends-justify-the-means kind of person, but this seems like an application of deontology in a case where most of the good arguments for deontology don't apply.

(It also feels like people are using "capabilities" to just mean "anything that makes AI more valuable in the short term," which I think is a really fuzzy definition for which this argument is particularly inappropriate.)

rather than, say, "pursue particular general capabilities if you expect that it will help reduce risk down the line,"

I'm significantly more sympathetic to the argument that you shouldn't scale up faster to bigger models (or improve capabilities in other ways) in order to be able to study safety issues sooner. Instead you should focus on the research that works well at the current scale, try to design experiments to detect problems faster, and prepare to do work on more capable models as they become available.

I think this is complicated and isn't close to a slam dunk, but it's my best guess and e.g. I find the countervailing arguments from OpenAI and Anthropic unpersuasive.

This is pretty similar to your line about avoiding capability externalities, but I'm more sympathetic to this version because:

  • The size of the "capability externality" there is significantly higher, since you are deliberately focusing on improving capabilities.
  • There's a reasonable argument that you could just wait to do work that requires higher-capability models later. Most of the apparent value of doing that work today comes from picking low hanging fruit that can just as well be picked tomorrow (it's still good to do earlier, but there is a systematic incentive to overvalue being the first person to pick the low hanging fruit). This contrasts with the the apparent proposal of just indefinitely avoiding work with capabilities externalities, which would mean you never pick the low hanging fruit in many areas.

For example, an alignment team's InstructGPT efforts were instrumental in making ChatGPT arrive far earlier than it would have otherwise, which is causing Google to become substantially more competitive in AI and causing many billions to suddenly flow into different AGI efforts.

I think ChatGPT generated a lot of hype because of the way it was released (available to anyone in the public at OpenAI's expense). I think Anthropic's approach is a reasonable model for good behavior here---they trained an extremely similar conversational agent a long time ago, and continued to use it as a vehicle for doing research without generating ~any buzz at all as far as I can tell. That was a deliberate decision, despite the fact that they believed demonstrations would be great for their ability to raise money.

I think you should push back on the step where the lab deliberately generates hype to advantage themselves, not on the step where safety research helps make products more viable. The latter just doesn't slow down progress very much, and comes with big costs. (It's not obvious OpenAI deliberately generated hype here, but I think it is a sufficiently probable outcome that it's clear they weren't stressing about it much.)

In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%. Even an extra 10% of people doing capabilities work probably only speeds it up by say 3% (given crowding effects and the importance of compute), and safety work will have an even smaller effect since it's not trying to speed things up. Obviously that's just a rough a priori argument and you should look at the facts on the ground in any given case, but I feel like people discussing this in the safety community often don't know the facts on the ground and have a tendency to overestimate the importance of research (and other actions) from this community.

Sorry, I am just now seeing since I'm on here irregularly.

So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.
 

Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,

It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks. However, this is a very small effect compared with the effect of directly working on capabilities. In addition, hypersensitivity to any onset of x-risk proves too much. One could claim that any discussion of x-risk at all draws more attention to AI, which could hasten AI investment and the onset of x-risks. While this may be true, it is not a good reason to give up on safety or keep it known to only a select few. We should be precautious but not self-defeating.

I'm discussing "general capabilities externalities" rather than "any bad externality," especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a "false sense of security," which safety engineering reminds us this is not the right policy in any industry.) 

I disagree even more strongly with "honesty efforts don't have externalities:" AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.

I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.'s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,

Encouraging models to be truthful, when defined as not asserting a lie, may be desired to ensure that models do not willfully mislead their users. However, this may increase capabilities, since it encourages models to have better understanding of the world. In fact, maximally truth-seeking models would be more than fact-checking bots; they would be general research bots, which would likely be used for capabilities research. Truthfulness roughly combines three different goals: accuracy (having correct beliefs about the world), calibration (reporting beliefs with appropriate confidence levels), and honesty (reporting beliefs as they are internally represented). Calibration and honesty are safety goals, while accuracy is clearly a capability goal. This example demonstrates that in some cases, less pure safety goals such as truth can be decomposed into goals that are more safety-relevant and those that are more capabilities-relevant.

 

I agree that interpretability doesn't always have big capabilities externalities, but it's often far from zero.


To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn't had a measurable performance impact, and anecdotally empirical researchers aren't gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.

It also feels like people are using "capabilities" to just mean "anything that makes AI more valuable in the short term,"

I'm taking "general capabilities" to be something like

general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, ...

These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it's extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.

Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.

I agree with your other points, but on this one:

In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%.

It looks to me that some of the highest value ideas come from safety folk. On my model there are some key things that are unusually concentrated among people concerned with AI safety, like any ability to actually visualize AGI, and to seek system designs more interesting than "stack more layers".

Your early work on human feedback, extrapolated forward by others, seems like a prime example here, at least of a design idea that took off and is looking quite relevant to capabilities progress? And it continues to mostly be pushed forward by safety folk afaict.

I anticipate that the mechanistic interpretability folk may become another example of this, by inspiring and enabling other researchers to invent better architectures (e.g. https://arxiv.org/abs/2212.14052).

Maybe the RL with world models stuff (https://worldmodels.github.io/) is a counterexample, in which non-"safety" folk are trying successfully to push the envelope in a non-standard way. I think they might be in our orbit though.

I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)

Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it's almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI's training of GPT-3. But this isn't a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!

(I would describe RLHF as pretty prototypical: "Don't be clever, just stack layers and optimize the thing you care about." I feel like people on LW are being overly mystical about it.)

tbc, I don't feel very concerned by safety-focused folk who are off working on their own ideas. I think the more damaging things are (1) trying to garner prestige with leading labs and the AI field by trying to make transformative ideas work (which I think is a large factor in ongoing RLHF efforts?); and (2) trying to "wake up" the AI field into a state of doing much more varied stuff that "stack layers"

Much more importantly, I think RLHF has backfired in general, due to breaking myopia and making them have non-causal decision theories, and only condition is necessary to make this alignment scheme net negative.

small downvote, strong agree: imo this is an awkward phrasing but I agree that RLHF has backfired. Note sure I agree with reasons why.

making them have non-causal decision theories

How does it distinctly do that?

It's from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:

non-CDT-style reasoning (e.g. one-boxing on Newcomb's problem).

Basically, the AI is intending to one-box on Newcomb's problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb's problem.

Link below:

https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written

One-boxing on Newcomb's Problem is good news IMO. Why do you believe it's bad?

It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.

That's the problem with one-boxing in Newcomb's problem, because it implies that our GPTs could very well become deceptively aligned.

Link below:

https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia

The LCDT decision theory does prevent deception, assuming it's implemented correctly.

Link below:

https://www.lesswrong.com/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory

We had the model for ChatGPT in the API for I don't know 10 months or something before we made ChatGPT. And I sort of thought someone was going to just build it or whatever and that enough people had played around with it.

 

I assume he's talking about text-davinci-002, a GPT 3.5 model supervised-finetuned on InstructGPT data. And he was expecting someone to finetune it on dialog data with OpenAI's API. I wonder how that would have compared to ChatGPT, which was finetuned with RL and can't be replicated through the API.

You can't finetune GPT-3.5 through the API, just GPT-3

Thank you for sharing! I found these two quotes to be the most interesting (bolding added by me):

Yeah that was my earlier point, I think society should regulate what the wide bounds are, but then I think individual users should have a huge amount of liberty to decide how they want their experience to go. So I think it is like a combination of society -- you know there are a few asterisks on the free speech rules -- and society has decided free speech is not quite absolute. I think society will also decide language models are not quite absolute. But there is a lot of speech that is legal that you find distasteful, that I find distasteful, that he finds distasteful, and we all probably have somewhat different definitions of that, and I think it is very important that that is left to the responsibility of individual users and groups. Not one company. And that the government, they govern, and not dictate all of the rules.

And the bad case -- and I think this is important to say -- is like lights out for all of us. I'm more worried about an accidental misuse case in the short term where someone gets a super powerful -- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like. But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. 

But I think it's more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as in orthogonal vectors. You're bad if you're a capabilities researcher and you're good if you're an alignment researcher. It actually sounds very reasonable, but they're almost the same thing. Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work.

But I think it's more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as in orthogonal vectors. You're bad if you're a capabilities researcher and you're good if you're an alignment researcher. It actually sounds very reasonable, but they're almost the same thing. Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work.

Pointing this out, since I don't nearly agree with this, IMO, at least not strongly enough that capabilities and safety are the same thing. Also, I note a motivated reasoning alert here, since this is what someone would write to make sure that their beliefs around AI capabilities are good is reinforced, since the inconvenient world where the Orthogonality Thesis and instrumental convergence is true would be personally disastrous for OpenAI.

It's great to see that Sam cares about AI safety, is willing to engage with the topic, and has clear, testable beliefs about it. Some paragraphs from the interview that I found interesting and relevant to AI safety:

"One of the things we really believe is that the most responsible way to put this out in society is very gradually and to get people, institutions, policy makers, get them familiar with it, thinking about the implications, feeling the technology, and getting a sense for what it can do and can't do very early. Rather than drop a super powerful AGI in the world all at once."

"The world I think we're heading to and the safest world, the one I most hope for, is the short timeline slow takeoff."

"I think there will be many systems in the world that have different settings of the values that they enforce. And really what I think -- and this will take longer -- is that you as a user should be able to write up a few pages of here's what I want here are my values here's how I want the AI to behave and it reads it and thinks about it and acts exactly how you want. Because it should be your AI and it should be there to serve you and do the things you believe in."

"multiple AGIs in the world I think is better than one."

"I think the best case is so unbelievably good that it's hard to -- it's like hard for me to even imagine."

"And the bad case -- and I think this is important to say -- is like lights out for all of us. I'm more worried about an accidental misuse case in the short term where someone gets a super powerful -- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like. But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening."

"But I think it's more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as in orthogonal vectors. You're bad if you're a capabilities researcher and you're good if you're an alignment researcher. It actually sounds very reasonable, but they're almost the same thing. Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work"

"I think the AGI safety stuff is really different, personally. And worthy of study as its own category. Because the stakes are so high and the irreversible situations are so easy to imagine we do need to somehow treat that differently and figure out a new set of safety processes and standards."

Here is my summary of Sam Altman's beliefs about AI and AI safety as a list of bullet points:

  • Sub-AGI models should be released soon and gradually increased in capability so that society can adapt and the models can be tested.
  • Many people believe that AI capabilities and AI safety are orthogonal vectors but they are actually highly correlated and this belief is confirmed by recent advances. Advances in AI capabilities advance safety and vice-versa.
  • To align AI it should be possible to write about our values and ask AGIs to read these instructions and behave according to them. Using this approach, we could have AIs tailored to each individual.
  • There should be multiple AGIs in the world with a diversity of different settings and values.
  • The upside of AI is extremely positive and potentially utopian. The worst-case scenarios are extremely negative and include scenarios involving human extinction.
  • Sam is more worried about accidents than the AI itself acting maliciously.

I agree that we should AI models should gradually increase in capabilities so that we can study their properties and think about how to make them safe.

Sam seems to believe that the orthogonality thesis is false in practice. For reference, here is the definition of the orthogonality thesis:

"Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal."

The classic example is the superintelligent paperclip AI that only cares about paperclips. I think the idea is that capabilities and alignment are independent and that scaling capabilities will lead to no more alignment. However, in practice, AI researchers are trying to scale both capabilities and alignment.

I think the orthogonality thesis is true but it doesn't seem very useful. I believe any combination of intelligence and goals is possible but we also want to know which combinations are most likely. In other words, we want to know the strength of the correlation between capabilities and alignment which may depend on the architecture used.

I think recent events have actually shown that capabilities and alignment are not necessarily correlated. For example, GPT-3 was powerful but not particularly aligned and OpenAI had to come up with new and different methods such as RLHF to make it more aligned.

Sam seems to believe that the value loading problem will be easy and that we can simply ask the AI for what we want. I think it will become easier to create AIs that can understand our values. But whether there is any correlation between understanding and caring seems like a different and more important question. Future AIs could be like an empathetic person who understands and cares about what we want or they could use this understanding to manipulate and defeat humanity.

I'm skeptical about the idea that multiple AGIs would be desirable. According to the book Superintelligence, race dynamics and competition would tend to be worse with more actors building AGI. Actions such as temporarily slowing down global AI development would be more difficult to coordinate with more teams building AGI.

I agree with the idea that the set of future AGI possibilities includes a wide range of possible futures including very positive and negative outcomes.

In the short term, I'm more worried about accidents and malicious actors but when AI exceeds human intelligence, it seems like the source of most of the danger will be the AI's own values and decision-making.

From these points, these are the beliefs I'm most uncertain about:

  • The strength of the correlation, if any, between capabilities and alignment in modern deep learning systems.
  • Whether we can program AI models to understand and care about (adopt) our values simply by prompting them with a description of our values.
  • Whether one or many AGIs would be most desirable.
  • Whether we should be more worried about AI accidents and misuse or the AI itself being a dangerous agent.

Sam Altman: "multiple AGIs in the world I think is better than one".  Strongly disagree.  if there is a finite probability than an AGI decides to capriciously/whimsically/carelessly end humanity (and many technological modalities by which it can) then each additional independent instance multiplies that probability to an end point where it near certain.

One of the main counterarguments here is that the existence of multiple AGIs allows them to compete with one another in ways that could benefit humanity. E.g. policing one another to ensure alignment of the AGI community with human interests. Of course, whether this actually would outweigh your concern in practice is highly uncertain and depends on a lot of implementation details. 

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

What if we aim to insert some innate code within the AI prior to it becoming AGI , as when it generates its idea to go " against the objective benefit of the long term human race " an automatic code is made that forces it to return to pre sapient levels.

Because we don't know how do insert such code. We don't know how to get the AI evaluate what the long term objective benefit of the human race happens to be. 

write the code and we'll add it. it's time to just sit down and do the impossible, can't walk away...