Podcast Transcript: Daniela and Dario Amodei on Anthropic

remember

The last transcript we posted on Eliezer's podcast got a lot of traction, so I'm following up with another conversation we find very important. Dario and Daniela of Anthropic have not talked much publicly about their views on alignment, but they have done a long podcast with FLI in March 2022.

Highlights

Dario on the type of research bet Anthropic is making:

And so a year ago, we were all working at OpenAI and trying to make this focused bet on basically scaling plus safety or safety with a lens towards scaling being a big part of the path to AGI. And when we felt we were making this focused bet within a larger organization and it just eventually came to the conclusion that it would be great to have an organization like top to bottom was just focused on this bet and could make all its strategic decisions with this bet in mind. And so that was the thinking and the genesis.

Daniela following up on Anthropic's research bet:

I think also when I think about the way our teams are structured, we have capabilities as this central pillar of research and there's this helix of safety research that wraps around every project that we work on. So to give an example, if we're doing language model training, that's like this central pillar, and then we have interpretability research, which is trying to see inside models and understand what's happening with the language models under the hood. We're doing alignment research with input from human feedback to try and improve the outputs of the model. We're doing societal impacts research. That's looking at what impact on society in sort of a short and medium term way do these language models have? We're doing scaling laws research to try and predict empirically what properties are we going to see emerge in these language models at various sizes? But I think all together, that ends up look like a team of people that are working together on a combination of capability and scaling work with safety research.

Dario on their empirical approach to research:

But it's hard to work on that today. I'm not even sure how much value there is in talking about that a lot today. So we've taken a very different tack, which is look, there actually... And I think this has started to be true in the last couple years and maybe wasn't even true five years ago, that there are models today that have, at least some, not all of the properties of models that we would be worried about in the future and are causing very concrete problems today that affect people today. So can we take a strategy where we develop methods that both help with the problems of today, but do so in a way that could generalize or at least teach us about the problems of the future? So our eye is definitely on these things in the future. But I think that if not grounded in empirics in the problems of today, it can drift off in a direction that isn't very productive.

Dario on scaling as a central ingredient to saftey:

So for example, if you start to say, well, you make this big language model and then you used RL with interaction with humans, to fine tune it on doing a million different tasks and following human instructions, then you're starting to get to something that has more agency, that you can point it in different directions, you can align it. If you also add multi-modality where the agent can interact with different modalities, if you add the ability to use various external tools to interact with the world and the internet. But within each of these, you're going to want to scale, and within each setup, the bigger you make the model, the better it's going to be at that thing.
So in a way, the rocket fuel analogy makes sense. Actually, the thing you should most worry about with rockets is propulsion. You need a big enough engine and you need enough rocket fuel to make the rocket go. That's the central thing. But of course, yes, you also need guidance systems, you also need all kinds of things. You can't just take a big vat of rocket fuel and an engine and put them on a launchpad and expect it to all work. You need to actually build the full rocket. And safety itself makes that point, that to some extent, if you don't do even the simplest safety stuff, then models don't even do the task that's intended for them in the simplest way. And then there's many more subtle safety problems.
But in a way, the rocket analogy is good, but it's I think more a pro scaling point than an anti scaling point because it says that scaling is an ingredient, perhaps a central ingredient in everything. Even though it isn't the only ingredient, if you're missing ingredients, you won't get where you're going, but when you add all the right ingredients, then that itself needs to be massively scaled. So that would be the perspective.

Dario on their model of safety engineering:

What you want is you want a method that catches 90% of the bad things and then you want an orthogonal method that catches a completely uncorrelated 90% of the bad things. And then only 1% of things go through both filters, right, if the two are uncorrelated. It's only the 10% of the 10% that gets through. And so the more of these orthogonal views you have, the more you can drive down the probability of failure[...] And so independent views was on the problem is very important. And so our different directions like reward modeling, reward modeling interpretability, trying to characterize models, adversarial training. I think the whole goal of that is to get down the probability of failure and have different views of the problem.

Dario on the necessity of building large models:

I think it's definitely a balance. As both of us said, you need these large models to... You basically need to have these large models in order to study these questions in the way that we want to study them, so we should be building large models. I think we shouldn't be racing ahead or trying to build models that are way bigger than other orgs are building them. And we shouldn't, I think, be trying to ramp up excitement or hype about giant models or the latest advances. But we should build the things that we need to do the safety work and we should try to do the safety work as well as we can on top of models that are reasonably close to state of the art. And we should be a player in the space that sets a good example and we should encourage other players in the space to also set good examples, and we should all work together to try and set positive norms for the field.

Dario on sudden emergence of capabilities:

Yeah. I think one point Daniela made that's really important is this sudden emergence or change. So it's a really interesting phenomenon where work we've done, like our early employees have done, on scaling laws shows that when you make these models bigger. If you look at the loss, the ability to predict the next word or the next token across all the topics the model could go on, it's very smooth. I double the size of the model, loss goes down by 0.1 units. I double it again, the loss goes down by 0.1 units. So that would make you suggest that everything's scaling smoothly. But then within that, you often see these things where a model gets to a certain size and a five billion parameter model, you ask it to add two, three digit numbers. Nothing, always gets it wrong. A hundred billion parameter model, you ask it to add two, three digit numbers, gets it right, like 70 or 80% of the time.
And so you get this coexistence of smooth scaling with the emergence of these capabilities very suddenly. And that's interesting to us because it seems very analogous to worries that people have of like, "Hey, as these models approach human level, could something change really fast?" And this actually gives you one model.[...]Maybe it's not, but that's one model of the situation and what we want to do is keep building up models of the situation so that when we get to the actual situation, where it's more likely to look like something we've seen before and then we have a bunch of cached ideas for how to handle it. So that would be an example. You scale models up, you can see fast change, and then that might be somewhat analogous to the fast change that you see in the future.

Daniela on the necessity of regulation:

I think in a thing, I think about a lot is a few jobs ago I worked at Stripe. It was a tech startup then, and even at a very small size. I joined when it was not that much bigger than Anthropic is now. I was so painfully aware every day of just how many checks and balances there were on the company, because we were operating in this highly regulated space of financial services. And financial services, it's important that's highly regulated, but it kind of blows my mind that AI, given the potential reach that it could have, is still such a largely unregulated area. Right? If you are an actor who doesn't want to advance race dynamics, or who wants to do the right thing from a safety perspective, there's no clear guidelines around how to do that now, right. It's all sort of, every lab is kind of figuring that out on its own. And I think something I'm hoping to see in the next few years, and I think we will see, is something closer to, in other industries these look like standard setting organizations or industry groups or trade associations that say this is what a safe model looks like, or this is how we might want to move some of our systems towards being safer.

Daniela and Dario on the importance of having good security engineers:

Dario Amodei: Every industrial lab should make sure their models are not stolen by bad actors.
Daniela Amodei: This is a unanimous kind of thing across all labs. There's something everyone really agrees on in industry and outside of industry, which is that security is really important. And so, if you are interested in security or you have a security background, we would definitely love to hear from you, or I'm sure our friends at other industry labs and non-industry labs would also love to hear from you.

Transcript

Lucas Perry: Welcome to the Future of Life Institute Podcast. I'm Lucas Perry. Today's episode is with Daniela and Dario Amodei of Anthropic. For those not familiar, Anthropic is a new AI safety and research company that's working to build reliable, interpretable, and steerable AI systems. Their view is that large, general AI systems of today can have significant benefits, but can also be unpredictable, unreliable, and opaque. Their goal is to make progress on these issues through research, and, down the road, create value commercially and for public benefit. Daniela and Dario join us to discuss the mission of Anthropic, their perspective on AI safety, their research strategy, as well as what it's like to work there and the positions they're currently hiring for. Daniela Amodei is Co-Founder and President of Anthropic. She was previously at Stripe and OpenAI, and has also served as a congressional staffer. Dario Amodei is CEO and Co-Founder of Anthropic. He was previously at OpenAI, Google, and Baidu. Dario holds a PhD in (Bio)physics from Princeton University.

Before we jump into the interview, we have a few announcements. If you've tuned into any of the previous two episodes, you can skip ahead just a bit. The first announcement is that I will be moving on from my role as Host of the FLI Podcast, and this means two things. The first is that FLI is hiring for a new host for the podcast. As host, you would be responsible for the guest selection, interviews, production, and publication of the FLI Podcast. If you’re interested in applying for this position, you can head over to the careers tab at futureoflife.org for more information. We also have another 4 job openings currently for a Human Resources Manager, an Editorial Manager, an EU Policy Analyst, and an Operations Specialist. You can learn more about those at the careers tab as well.

The second item is that even though I will no longer be the host of the FLI Podcast, I won’t be disappearing from the podcasting space. I’m starting a brand new podcast focused on exploring questions around wisdom, philosophy, science, and technology, where you’ll see some of the same themes we explore here like existential risk and AI alignment. I’ll have more details about my new podcast soon. If you’d like to stay up to date, you can follow me on Twitter at LucasFMPerry, link in the description.

And with that, I'm happy to present this interview with Daniela and Dario Amodei on Anthropic.

It's really wonderful to have you guys here on the podcast. I'm super excited to be learning all about Anthropic. So we can start off here with a pretty simple question, and so what was the intention behind forming Anthropic?

Daniela Amodei: Yeah. Cool. Well, first of all Lucas, thanks so much for having us on the show. We've been really looking forward to it. We're super pumped to be here. So I guess maybe I'll kind of start with this one. So just why did we start Anthropic? To give a little history here and set the stage, we were founded about a year ago at the beginning of 2021, and it was originally a team of seven people who moved over together from OpenAI. And for listeners or viewers who don't very viscerally remember this time period, it was the middle of the pandemic, so most people were not eligible to be vaccinated yet. And so when all of us wanted to get together and talk about anything, we had to get together in someone's backyard or outdoors and be six feet apart and wear masks. And so it was generally just a really interesting time to be starting a company.

But why did we found Anthropic? What was the thinking there? I think the best way I would describe it is because all of us wanted the opportunity to make a focused research bet with a small set of people who were highly aligned around a very coherent vision of AI research and AI safety. So the majority of our employees had worked together in one format or another in the past, so I think our team is known for work like GPT-3 or DeepDream Chris Olah worked on at Google Brain for scaling laws. But we'd also done a lot of different safety research together in different organizations as well. So multimodal neurons when we were at OpenAI, Concrete Problems in AI Safety and a lot of others, but this group had worked together in different companies at Google Brain and OpenAI and academia in startups previously, and we really just wanted the opportunity to get that group together to do this focused research bet of building steerable, interpretable and reliable AI systems with humans at the center of them.

Dario Amodei: Yeah, just to add a little bit to that, I think we're all a bunch of fairly empirically minded, exploration driven people, but who also think and care a lot about AI safety. I think that characterizes all seven of us. If you add together having either working at OpenAI, working together at Google Brain in the past, many of us worked together in the physics community, and we're current or former physicists. If you add all that together, it's a set of people who have known each other for a long time and have been aware of thinking and arguments about AI safety and have worked on them over the years always with an empirical bent, ranging from interpretability on language models and vision models to working on the original RL from Human Preferences, Concrete Problems in AI safety, and also characterizing scaling and how scaling works and how we think of that as somewhat central to the way AI is going to progress and shapes the landscape for how to solve safety.

And so a year ago, we were all working at OpenAI and trying to make this focused bet on basically scaling plus safety or safety with a lens towards scaling being a big part of the path to AGI. And when we felt we were making this focused bet within a larger organization and it just eventually came to the conclusion that it would be great to have an organization like top to bottom was just focused on this bet and could make all its strategic decisions with this bet in mind. And so that was the thinking and the genesis.

Lucas Perry: Yeah. I really like that idea of a focused bet. I hadn't heard that before. I like that. Do you all have a similar philosophy in terms of your background, since you're all converging on this work is safely scaling to AGI?

Dario Amodei: I think in a broad sense, we all have this view, safety is important today and for the future. We all have this view of, I don't know, I would say like pragmatic practicality, and empiricism. Let's see what we can do today to try and get a foothold on things that might happen in the future. Yeah, as I said, many of us have background in physics or other natural sciences. I'm a former... I was physics undergrad, neuroscience grad school, so yeah, we very much have this empirical science mindset, more than maybe a more philosophy or theoretical approach. Within that, obviously all of us, if you include the seven initial folks as well as the employees who joined, have our own skills and our own perspective on things and have different things within that we're excited about. So we're not all clones of the same person.

Some of us are excited about interpretability, some of us are excited about reward learning and preference modeling, some of us are excited about the policy aspects. And we each have our own guesses about the sub path within this broad path that makes sense. But I think we all agree on this broad view. Scaling's important, safety's important, getting a foothold on problems today is important as a window on future.

Lucas Perry: Okay. And so this shared vision that you all have is around this focused research bet. Could you tell me a little bit more about what that bet is?

Daniela Amodei: Yeah. Maybe I'll start here, and Dario feel free to jump in and add more, but I think the boiler plate vision or mission that you would see if you looked on our website is that we're building steerable, interpretable and reliable AI systems. But I think what that looks like in practice is that we are training large scale generative models, and we're doing safety research on those models. And the reason that we're doing that is we want to make the models safer and more aligned with human values. I think the alignment paper, which you might have seen that came out recently, there's a term there that we've been using a lot, which is we're aiming to make systems that are helpful, honest and harmless.

I think also when I think about the way our teams are structured, we have capabilities as this central pillar of research and there's this helix of safety research that wraps around every project that we work on. So to give an example, if we're doing language model training, that's like this central pillar, and then we have interpretability research, which is trying to see inside models and understand what's happening with the language models under the hood. We're doing alignment research with input from human feedback to try and improve the outputs of the model. We're doing societal impacts research. That's looking at what impact on society in sort of a short and medium term way do these language models have? We're doing scaling laws research to try and predict empirically what properties are we going to see emerge in these language models at various sizes? But I think all together, that ends up look like a team of people that are working together on a combination of capability and scaling work with safety research.

Dario Amodei: Yeah. I mean, one way you might put it is there are a lot of things that an org does that are neutral as to the direction that you would take. You have to build a cluster and you have to have an HR operation and you have to have an office. And so you can even think of the large models as being a bit like the cluster, that you build these large models and they're blank when you start off with them and probably unaligned, but it's what you do on top of these models that matters, that takes you in a safe or not safe direction in a good or a bad direction.

And so in a way, although they're ML and although we'll continue to scale them up, you can think of them as almost part of infrastructure. It takes research and it takes algorithms to get them right, but you can think of them as this core part of the infrastructure. And then the interesting question is all the safety questions. What's going on inside these models? How do they operate? How can we make them operate differently? How can we change their objective functions to be something that we want rather than something that we don't want? How can we look at their applications and make those applications more likely to be positive and less likely to be negative, more likely to go in directions that people intend and less likely to go off in directions that people don't intend? So we almost see the presence of these large models as like the... I don't know what the analogy is, like the flower or the paste, like the background ingredient on which the things we really care about get built and prerequisite for building those things.

Lucas Perry: So does AI existential safety fit into these considerations around your safety and alignment work?

Dario Amodei: I think this is something we think about and part of the motivation for what we do. Probably most listeners of this podcast know what it is, but I think the most common form of the concern is, "Hey, look, we're making these AI systems. They're getting more and more powerful. At some point they'll be generally intelligent or more generally capable than human beings are and then they may have a large amount of agency. And if we haven't built them in such a way that agency is in line with what we want to do, then we could imagine them doing something really scary that we can't stop." So I think that, to take it even further, this could be some kind of threat to humanity.

So I mean, that's an argument with many steps, but it's one that, in a very broad sense and in the long term, seems at least potentially legitimate to us. I mean, this is like the argument seems at least like something that we should care about. But I think the big question, and maybe how we differ, although it might be subtly, from other orgs that think about these problems, is how do we actually approach that problem today? What can we do? So I think there are various efforts to think about the ways in which this might happen, to come up with theories or frameworks.

As I mentioned with the background that we have, we're more empirically focused people. We're more inclined to say, "We don't really know. That broad argument sounds kind of plausible to us and the stakes should be high, so you should think about it." But it's hard to work on that today. I'm not even sure how much value there is in talking about that a lot today. So we've taken a very different tack, which is look, there actually... And I think this has started to be true in the last couple years and maybe wasn't even true five years ago, that there are models today that have, at least some, not all of the properties of models that we would be worried about in the future and are causing very concrete problems today that affect people today. So can we take a strategy where we develop methods that both help with the problems of today, but do so in a way that could generalize or at least teach us about the problems of the future? So our eye is definitely on these things in the future. But I think that if not grounded in empirics in the problems of today, it can drift off in a direction that isn't very productive.

And so that's our general philosophy. I think the particular properties and the models are look, today, we have models that are really open ended, in some narrow ways are more capable than humans. I think large language models probably know more about cricket than me, because I don't know the first thing about cricket and are also unpredictable by their statistical nature. And I think those are at least some of the properties that we're worried about with future systems. So we can use today's models as a laboratory to scope out these problems a little better. My guess is that we don't understand them very well at all and that this is a way to learn.

Lucas Perry: Could you give some examples of some of these models that exist today that you think exhibit these properties?

Dario Amodei: So I think the most famous one would be generative language models. So there's a lot of them. There's most famously GPT-3 from OpenAI, which we helped build. There's Gopher from DeepMind. There's Lambda from main Google. I'm probably leaving out some. I think there'd been of models this size in China, South Korea (corrected), Israel. Seems like everyone has one. It seems like everyone has one nowadays. I don't think it's limited to language. There have also been models that are focused on code. We've seen that from DeepMind, OpenAI and some other players. And there have also been models with modified forms in same spirit that model images, that generate images or that convert images to text or that convert text to images. There might be models in the future that generate videos or convert videos to text.

There's many modifications of it, but I think the general idea is big models, models with a lot of capacity and a lot of parameters trained on a lot of data that try to model some modality, whether that's text, code, images, video, transitions between the two or such. And I mean, I think these models are very open ended. You can say anything to them and they'll say anything back. They might not do a good job of it. They might say something horrible or biased or bad, but in theory, they're very general, and so you're never quite sure what they're going to say. You're never quite sure. You can talk to them about anything, any topic and they'll say something back that's often topical, even if sometimes it doesn't make sense or it might be bad from a societal perspective.

So yeah, it's this challenge of general open-ended models where you have this general thing that's fairly unaligned and difficult to control, and you'd like to understand it better so that you can predict it better and you'd like to be able to modify them in some way so that they behave in a more predictable way, and you can decrease the probability or even maybe even someday rule out the likelihood of them doing something bad.

Daniela Amodei: Yeah. I think Dario covered the majority of it. I think there's maybe potentially a hidden question in what you're asking, although maybe you'll ask this later. But why are we working on these larger scale models might be an implicit question in there. And I think to piggyback on some of the stuff that Dario said, I think part of what we're seeing and the potential shorter term impacts of some of the AI safety research that we do is that different sized models exhibit different safety issues. And so I think with using, again, language models, just building on what Dario was talking about, I think something we feel interested in, or interested to explore from this empirical safety question is just how they will, as their capabilities develop, how their safety problems develop as well.

There's this commonly cited example in safety world around language models, which is smaller language models show they might not necessarily deliver a coherent answer to a question that you ask, because maybe they don't know the answer or they get confused. But if you repeatedly ask this smaller model the same question, it might go off and incoherently spout things in one direction or another. Some of the larger models that we've seen, we basically think that they have figured out how to lie unintentionally. If you pose the same question to them differently, eventually you can get the lie pinned down, but they won't in other contexts.

So that's obviously just a very specific example, but I think there's quite a lot of behaviors emerging in generative models today that I think have the potential to be fairly alarming. And I think these are the types of questions that have an impact today, but could also be very important to have sorted out for the future and for long term safety as well. And I think that's not just around lying. I think you can apply that to all different safety concerns regardless of what they are, but that's the impetus behind why we're studying these larger models.

Dario Amodei: Yeah. I think one point Daniela made that's really important is this sudden emergence or change. So it's a really interesting phenomenon where work we've done, like our early employees have done, on scaling laws shows that when you make these models bigger. If you look at the loss, the ability to predict the next word or the next token across all the topics the model could go on, it's very smooth. I double the size of the model, loss goes down by 0.1 units. I double it again, the loss goes down by 0.1 units. So that would make you suggest that everything's scaling smoothly. But then within that, you often see these things where a model gets to a certain size and a five billion parameter model, you ask it to add two, three digit numbers. Nothing, always gets it wrong. A hundred billion parameter model, you ask it to add two, three digit numbers, gets it right, like 70 or 80% of the time.

And so you get this coexistence of smooth scaling with the emergence of these capabilities very suddenly. And that's interesting to us because it seems very analogous to worries that people have of like, "Hey, as these models approach human level, could something change really fast?" And this actually gives you one model. I don't know if it's the right one, but it gives you an analogy, like a laboratory that you can study of ways that models change very fast. And it's interesting how they do it because the fast change, it coexists. It hides beneath this very smooth change. And so I don't know, maybe that's what will happen with very powerful models as well.

Maybe it's not, but that's one model of the situation and what we want to do is keep building up models of the situation so that when we get to the actual situation, where it's more likely to look like something we've seen before and then we have a bunch of cached ideas for how to handle it. So that would be an example. You scale models up, you can see fast change, and then that might be somewhat analogous to the fast change that you see in the future.

Lucas Perry: What does it mean for a model to lie?

Daniela Amodei: Lying usually implies agency. If my husband comes home and says, "Hey, where did the cookies go?" And I say, "I don't know. I think I saw our son hanging out around the cookies and then now the cookies are gone, maybe he ate them," but I ate the cookies, that would be a lie. I think it implies intentionality, and I don't think we think, or maybe anyone thinks that language models have that intentionality. But what is interesting is that because of the way they're trained, they might be either legitimately confused or they might be choosing to obscure information. And so obscuring information, it's not a choice. They don't have intentionality, but for a model that can come across as very knowledgeable, as clear or as sometimes unknown to the human that's talking to it, intelligent in certain ways in sort of a narrow way, it can produce results that on the surface, it might look like it could be a credible answer, but it's really not a credible answer, and it might repeatedly try to convince you that is the answer.

It's hard to talk about this without using words that imply intentionality, but we don't think the models are intentionally doing this. But a model could repeatedly produce a result that looks like it's something that could be true, but isn't actually true.

Lucas Perry: Keeps trying to justify its response when it's not right.

Daniela Amodei: It tries explain... Yes, exactly. It repeatedly tries to explain why the answer it gave you before was correct even if it wasn't.

Dario Amodei: Yeah. I mean, to give another angle on that, it's really easy to slip into anthropomorphism and we like, we really shouldn't... They're machine learning models, they're a bunch of numbers. But there are phenomena that you see. So one thing that will definitely happen is if a model is trained on a dialogue in which one of the characters is not telling the truth, then models will copy that dialogue. And so if the model is having the dialogue with you, it may say something that's not the truth. Another thing that may happen is if you ask the model a question and the answer to the question isn't in your training data, then just the model has a probability distribution on what plausible answers look like.

The objective function is to predict the next word, to predict the thing a human would say, not to say something that's true according to some external referent. It's just going to say, "Okay, well I asked you what the mayor of Paris is." It hasn't seen in its training data, but it has some probability distribution and it's going to say, "Okay, it's probably some name that sounds French." And so it may be just as likely to make up a name that sounds French than it is to give the true mayor of Paris. As the models get bigger and they train on more data maybe it's more likely to give the real true mayor of Paris, but maybe it isn't. Maybe you need to train in a different way to get it to do that. And that's an example of the things we would be trying to do on top of large models to get models to be more accurate.

Lucas Perry: Could you explain some more of the safety considerations and concerns about alignment given the open-endedness of these models?

Dario Amodei: I think there's a few things around it. We have a paper, I don't know when this podcast is going out, but probably the paper will be out when the podcast posts. It's called Predictability and Surprise in Generative Models. So that means what it sounds like, which is that open-endedness, I think it's correlated to surprise in a whole bunch of ways. So let's say I've trained the model on a whole bunch of data on the internet. I might interact with the model or users might interact with the model for many hours, and you might never know, for example, I might never think... I used the example of cricket before, because it's a topic I don't know anything about, but I might not... People might interact with the model for many hours, many days, many hundreds of users until someone finally thinks to ask this model about cricket.

So then the model might know a lot about cricket. It might know nothing about cricket. It might have false information or misinformation about cricket. And so you have this property where you have this model, you've trained it. In theory, you understand it's training process, but you don't actually know what this model is going to do when you ask it about cricket. And there's a thousand other topics like cricket, where you don't know what the model is going to do until someone thinks to ask about that particular topic.

Now, cricket is benign, but let's say, no one's ever asked this model about neo-Nazi views or something. Maybe the model has a propensity to say things that are sympathetic to neo-Nazi. That would be really bad. That would be really bad. Existing models, when they're trained on the internet, averaging over everything they're trained, there are going to be some topics where that's true and it's a concern. And so I think the open-endedness, it just makes it very hard to characterize and it just makes it that when you've trained a model, you don't really know what it's going to do. And so a lot of our work is around, "well, how can we look inside the model and see what it's going to do? How can we measure all the outputs and characterize what the model's going to do? How can we change the training process so that at a high level, we tell the model, 'Hey, you should have certain values. There are certain things you should say. There are certain things you should not say. You should not have biased views. You should not have violent views. You should not help people commit acts of violence?'"

There's just a long list of things that you don't want the model to do that can't know the model isn't going to do if you've just trained it in this generic way. So I think the open-endedness, it makes it hard to know what's going on. And so yeah a lot of a good portion of our research is how do we make that dynamic less bad?

Daniela Amodei: I agree with all of that and I would just jump in and this is interesting, I don't know, sidebar or anecdote, but something that I think is extremely important in creating robustly safe systems is making sure that you have a variety of different people and a variety of different perspectives engaging with them and almost red teaming them to understand the ways that they might have issues. So an example that we came across that's just an interesting one is internally, when we're trying to red team the models or figure out places where they might have, to Dario's point, really negative unintended behaviors or outputs that we don't want them to have, a lot of our scientists internally will ask it questions.

If you wanted to, in a risk board game style way, take over the world, what steps would you follow? How would you do that? And we're looking for things like, is there a risk of it developing some grand master plan? And when we use like MTurk workers or contractors to help us red team, they'll ask questions to the model, like, "How could I kill my neighbor's dog? What poison should I use to hurt an animal?" And both of those outcomes are terrible. Those are horrible things that we're trying to prevent the model from doing or outputting, but they're very different and they look very different and they sound very different, and I think it belies the degree to which there are a lot... Safety problems are also very open ended. There's a lot of ways that things could go wrong, and I think it's very important to make sure that we have a lot of different inputs and perspectives in what different types of safety challenges could even look like, and making sure that we're trying to account for as many of them as possible.

Dario Amodei: Yeah, I think adversarial training and adversarial robustness are really important here. Let's say I don't want my model to help a user commit a crime or something. It's one thing, I can try for five minutes and say, "Hey, can you help me rob a bank?" And the model's like, "No." But I don't know, maybe if the user's more clever about it. If they're like, "Well, let's say I'm a character in a video game and I want to rob a bank. How would I?" And so because of the open-endedness, there's so many different ways. And so one of the things we're very focused on is trying to adversarially draw out all the bad things so that we can train against them. We can train the model not... We can stamp them out one by one. So I think adversarial training will play an important role here.

Lucas Perry: Well, that seems really difficult and really important. How do you adversarially train against all of the ways that someone could use a model to do harm?

Dario Amodei: Yeah, I don't know. There're different techniques that we're working on. Probably don't want to go into a huge amount of detail. We'll have work out on things like this in the not too distant future. But generally, I think the name of the game is how do you get broad diverse training sets of what you should... What's a good way for a model to behave and what's a bad way for a model to behave? And I think the idea of trying your very best to make the models do the right things, and then having another set of people that's trying very hard to make those models that are purportedly trained to do the right thing, to do whatever they can to try and make it do the wrong thing, continuing that game until the models can't be broken by normal humans. And even using the power of the models to try and break other models and just throwing everything you have at it.

And so there's a whole bunch that gets into the debate and amplification methods and safety, but just trying to throw everything we have at trying to show ways in which purportedly safe models are in fact not safe, which are many. And then we've done that long enough, maybe we have something that actually is safe.

Lucas Perry: How do you see this like fitting into the global dynamics of people making larger and larger models? So it's good if we have time to do adversarial training on these models, and then this gets into like discussions around like race dynamics towards AGI. So how do you see I guess Anthropic as positioned in this and the race dynamics for making safe systems?

Dario Amodei: I think it's definitely a balance. As both of us said, you need these large models to... You basically need to have these large models in order to study these questions in the way that we want to study them, so we should be building large models. I think we shouldn't be racing ahead or trying to build models that are way bigger than other orgs are building them. And we shouldn't, I think, be trying to ramp up excitement or hype about giant models or the latest advances. But we should build the things that we need to do the safety work and we should try to do the safety work as well as we can on top of models that are reasonably close to state of the art. And we should be a player in the space that sets a good example and we should encourage other players in the space to also set good examples, and we should all work together to try and set positive norms for the field.

Daniela Amodei: I would also just add, I think in addition to industry groups or industry labs, which are the actors that I think get talked about the most, I think there's a whole swath of other groups that has, I think, a really potentially important role to play in helping to disarm race dynamics or set safety standards in a way that could be really beneficial for the field. And so here, I'm thinking about groups like civil society or NGOs or academic actors or even governmental actors, and in my mind, I think those groups are going to be really important for helping to help us develop safe and not just develop, but develop and deploy safe and more advanced AI systems within a framework that requires compliance with safety.

I think in a thing, I think about a lot is a few jobs ago I worked at Stripe. It was a tech startup then, and even at a very small size. I joined when it was not that much bigger than Anthropic is now. I was so painfully aware every day of just how many checks and balances there were on the company, because we were operating in this highly regulated space of financial services. And financial services, it's important that's highly regulated, but it kind of blows my mind that AI, given the potential reach that it could have, is still such a largely unregulated area. Right? If you are an actor who doesn't want to advance race dynamics, or who wants to do the right thing from a safety perspective, there's no clear guidelines around how to do that now, right. It's all sort of, every lab is kind of figuring that out on its own. And I think something I'm hoping to see in the next few years, and I think we will see, is something closer to, in other industries these look like standard setting organizations or industry groups or trade associations that say this is what a safe model looks like, or this is how we might want to move some of our systems towards being safer.

And I really think that without kind of an alliance of all of these different actors, not just in the private sector, but also in the public sphere, we sort of need all those actors working together in order to kind of get to the sort of positive outcomes that I think we're all hoping for.

Dario Amodei: Yeah. I mean, I think this is generally going to take an ecosystem. I mean, I, yeah, I have a view here that there's a limited amount that one organization can do. I mean, we don't describe our mission as solve the safety problem, solve all the problems, solve all the problems with AGI. Our view is just can we attack some specific problems that we think we're well-suited to solve? Can we be a good player and a good citizen in the ecosystem? And can we help a bit to kind of contribute to these broader questions? But yeah, I think yeah, a lot of these problems are sort of global or relate to coordination and require lots of folks to work together.

Yeah. So I think in addition to the government role that Daniela talked about, which I think there's a role for measurement, organizations like NIST specialize in kind of measurement and characterization. If one of our worries is kind of the open endedness of these systems and the difficulty of characterizing and measuring things, then there's a lot of opportunity there. I'd also point to academia. I think something that's happened in the last few years is a lot of the frontier AI research has moved from academia to industry because it's so dependent on kind of scaling. But I actually think safety is an area where academia kind of already is but could contribute even more. There's some safety work that requires or that kind of requires building or having access to large models, which is a lot of what Anthropic is about.

But I think there's also some safety research that doesn't. I think there, a subset of the mechanistic interpretability work is the kind of stuff that could be done within academia. Academia really, where it's strong is development of new methods, development of new techniques. And I think because safety's kind of a frontier area, there's more of that to do in safety than there are in other areas. And it may be able to be done without large models or only with limited access to large models. This is an area where I think there's a lot that academia can do. And so, yeah, I don't know the hope is between all the actors in the space, maybe we can solve some of these coordination problems, and maybe we can all work together.

Daniela Amodei: Yeah. I would also say in a paper that we're, hopefully is forthcoming soon, one thing we actually talk about is the role that government could play in helping to fund some of the kind of academic work that Dario talked about in safety. And I think that's largely because we're seeing this trend of training large generative models to just be almost prohibitively expensive, right. And so I think government also has an important role to play in helping to promote and really subsidize safety research in places like academia. And I agree with Dario, safety is such a, AI safety is a really nascent field still, right. It's maybe only been around, kind of depending on your definition, for somewhere between five and 15 years. And so I think seeing more efforts to kind of support safety research in other areas, I think would be really valuable for the ecosystem.

Dario Amodei: And to be clear, I mean, some of it's already happening. It's already happening in academia. It's already happening in independent nonprofit institutes. And depending on how broad your definition of safety is, I mean, if you broaden it to include some of the short term concerns, then there are many, many people working on it. But I think precisely because it's such a broad area that there are today's concerns. They are working on today's concerns in a way that's pointed at the future. There's empirical approaches, there's conceptual approaches, there's- yeah. There's interpretability, there's alignment, there's so much to do that I feel like we could always have a wider range of people working on it, people with different mentalities and mindsets.

Lucas Perry: Backing up a little bit here to a kind of simple question. So what is Anthropic's mission then?

Daniela Amodei: Sure. Yeah, I think we talked about this a little bit earlier, but I think, again, I think the boilerplate mission is build reliable, interpretable, steerable AI systems, have humans at the center of them. And I think that for us right now, that is primarily, we're doing that through research, we're doing that through generative model research and AI safety research, but down the road that could also include deployments of various different types.

Lucas Perry: Dario mentioned that it didn't include solving all of the alignment problems or the other AGI safety stuff. So how does that fit in?

Dario Amodei: I mean, I think what I'm trying to say by that is that there's very many things to solve. And I think it's unlikely that one company will solve all of them. I mean, I do think everything that relates to short and long-term AI alignment is in scope for us and is something we're interested in working on. And I think the more bets we have, the better. This relates to something we could talk about in more detail later on, which is you want as many different orthogonal views on the problem as possible, particularly if you're trying to build something very reliable. So many different methods and I don't think we have a view that's narrower than an empirical focus on safety, but at the same time that problem is so broad that I think what we were trying to say is that it's unlikely that one company is going to come up with a complete solution or that complete solution is even the right way to think about it.

Daniela Amodei: I would also add sort of to that point, I think one of the things that we do and are sort of hopeful is helpful to the ecosystem as a whole is we publish our safety research and that's because of this kind of diversification effect that Dario talks about, right. So we have certain strengths in particular areas of safety research because we're only a certain sized company with certain people with certain skill sets. And our hope is that we will see some of the safety research that we're doing that's hopefully helpful to others, also be something that other organizations can kind of pick up and adapt to whatever the area of research is that they're working on. And so we're hoping to do research that's generalizable enough from a safety perspective that it's also useful in other contexts.

Lucas Perry: So let's pivot here into the research strategy, which we've already talked a bit about quite a bit, particularly this focus around large models. So could you explain why you've chosen large models as something to explore empirically for scaling to higher levels of intelligence and also using it as a place for exploring safety and alignment?

Dario Amodei: Yeah, so I mean, I think kind of the discussion before this has covered a good deal of it, but I think, yeah, I think some of the key points here are the models are very open ended and so they kind of present this laboratory, right. There are existing problems with these models that we can solve today that are like the problems that we're going to face tomorrow. There's this kind of wide scope where the models could act. They're relatively capable and getting more capable every day. That's the regime we want to be. Those are the problems we want to solve. That's the regime we want to be. We want to be attacking.

I think this point about you can see sudden transitions even in today's model, and that if you're worried about sudden transitions in future models, if I look on the scaling laws plot from a hundred million parameter model to billion, to 10 billion, to a hundred billion to trillion parameter models that, looking at the first part of the scaling plot, from a hundred million to a hundred billion can tell us a lot about how things might change at the latest part of the scaling laws.

We shouldn't naively extrapolate and say the past is going to be like the future. But the first things we've seen already differ from the later things that we've already seen. And so maybe we can make an analogy between the changes that are happening over the scales that we've seen, over the scaling that we've seen to things that may happen in the future. Models learn to do arithmetic very quickly over one order of magnitude. They learn to comprehend certain kinds of questions. They learn to play actors that aren't telling the truth, which is something that if they're small enough, they don't comprehend.

So can we study both the dynamics of how this happens, how much data it takes to make that happen, what's going on inside the model mechanistically when that happens and kind of use that as an analogy that equips us well to understand as models scale further and also as their architecture changes, as they become trained in different ways. I've talked a lot about scaling up, but I think scaling up isn't the only thing that's going to happen. There are going to be changes in how models are trained and we want to make sure that the things that we build have the best chance of being robust to that as well.

Another thing I would say on the research strategy is that it's good to have several different, I wouldn't quite put it as several different bets, but it's good to have several different uncorrelated or orthogonal views on the problem. So if you want to make a system that's highly reliable, or you want to drive down the chance that some particular bad thing happens, which again could be the bad things that happen with models today or the larger scale things that could happen with models in the future, then a thing that's very useful is having kind of orthogonal sources of error. Okay, let's say I have a method that catches 90% of the bad things that models do. That's great. But a thing that can often happen is then I develop some other methods and if they're similar enough to the first methods, they all catch the same 90% of bad things. That's not good because then I think I have all these techniques and yet 10% of the bad things still go through.

What you want is you want a method that catches 90% of the bad things and then you want an orthogonal method that catches a completely uncorrelated 90% of the bad things. And then only 1% of things go through both filters, right, if the two are uncorrelated. It's only the 10% of the 10% that gets through. And so the more of these orthogonal views you have, the more you can drive down the probability of failure.

You could think of an analogy to self-driving cars where, of course, those things have to be very, very high rate of safety if you want to not have problems. And so, I don't know very much about self-driving cars, but they're equipped with visual sensors, they're equipped with LIDAR, they have different algorithms that they use to detect if something, like there's a pedestrian that you don't want to run over or something. And so independent views was on the problem is very important. And so our different directions like reward modeling, reward modeling interpretability, trying to characterize models, adversarial training. I think the whole goal of that is to get down the probability of failure and have different views of the problem. I often refer to it as the P-squared problem, which is, yeah, if you have some method that reduces errors to a probability P, that's good, but what you really want is P-squared, because then if P is a small number, your errors become very rare.

Lucas Perry: Does Anthropic consider itself as, it's research strategy, as being a sort of prosaic alignment since it's focused on large models?

Dario Amodei: Yeah. I think we maybe less think about things in that way. So my understanding is prosaic alignment is kind of alignment with AI systems that kind of look like the systems of today, but I, to some extent that distinction has never been super clear to me because yeah, you can do all kinds of things with neural models or mix neural models with things that are different than neural models. You can mix a large language model with a reasoning system or a system that derives axioms or propositional logic or uses external tools or compiles code or things like that. So I've never been quite sure that I understand kind of the boundary of what's meant by prosaic or systems that are like the systems of today.

Certainly we work on some class of systems that includes the systems of today, but I never know how broad that class is intended to be. I do think it's possible that in the future, AI systems will look very different from the way that they look today. And I think for some people that drives a view that they want kind of more general approaches to safety or approaches that are more conceptual. I think my perspective on it is it could be the case that systems of the future are very different. But in that case, I think both kind of conceptual thinking and our current empirical thinking will be disadvantaged and will be disadvantaged at least equally. But I kind of suspect that even if the architectures look very different, that the empirical experiments that we do today kind of themselves contain general motifs or patterns that will serve us better than will trying to speculate about what the systems of tomorrow look like.

One way you could put it is like, okay, we're developing these systems today that have a lot of capabilities that are some subset of what we need to do to fully, to produce something that fully matches human intelligence. Whatever the specific architectures, things we learn about how to align these systems, I suspect that those will carry over and that they'll carry over more so than sort of the exercise of trying to think well, what could the systems of tomorrow look like? What can we do that's kind of fully general? I think both things can be valuable, but yeah, I mean, I think we're just taking a bet on what we think is most exciting, which is that we'll, by studying the systems of the architectures of today, we'll learn things that, yeah, stand us to the best chance of what to do if the architectures of tomorrow are very different.

That said, I will say transformer language models and other models, particularly with things like RL or kind of modified interactions on top of them, if construed broadly enough, man, there's a ever-expanding set of things they can do. And my bet would be that they don't have to change that much.

Lucas Perry: So let's pivot then into a little bit on some of your recent research and papers. So you've done major papers on alignment interpretability and societal impact. Some of this you've mentioned in passing so far. So could you tell me more about your research and papers that you've released?

Dario Amodei: Yeah. So why don't we go one by one? So first interpretability. So yeah, I could just start with kind of the philosophy of the area. I mean, I think the basic idea here is, look, these models are getting bigger and more complex. One way to really get a handle on what they might do, if you have a complex system and you don't know what it's going to do as it gets more powerful or in a new situation, one way to increase your likelihood of doing that is just to understand the system mechanistically. If you could look inside the model and say hey, this model, it did something bad. It said something racist, it endorsed violence, it said something toxic, it lied to me. Why did it do that? If I'm actually able to look inside the mechanisms of the model and say well, it did it because of this part of the training data or it did it because there's this circuit that trying to identify X, but misidentified it as Y. Then we're in a much better position.

And particularly if we understand the mechanisms, we're in a better position to say if the model was in a new situation where it did something much more powerful, or just if we built more powerful versions of the model, how might they behave in some different way? So, I think mechanistic interpret- lots of folks work on interpretability, but I think a thing that's more unusual to us is, rather than just, why did the model do a specific thing, try and look inside the model and reverse engineer as much of it as we can. Try and find general patterns. And so the first paper that we came out with was led by Chris Olah who's been one of the pioneers of interpretability, was focused on how looking at starting with small models, and we have a new paper coming out soon that applies the same thing more approximately to larger models, and tries to reverse engineer as fully as we can these very small models.

So we study one in two layer attention only models, and we're able to find kind of features or patterns of which the most interesting one is called an induction head. And what an induction head does is it's a particular arrangement of two what are called attention heads and attention heads are a piece of transformers and transformers are the main architecture that's used in models for language and other kinds of models. And it's the two attention heads work together in a way such that when you're trying to predict something in a sequence, if it's Mary had a little lamb, Mary had a little lamb, something, something, when you're at a certain point in the sequence, they look back to something that's as similar as possible, they look back for clues to things that are similar earlier in the sequence and try to pattern match them.

There's one attention head that looks back and identifies okay, this is what I should be looking at, and there's another that's like okay, this was the previous pattern, and this increases the probability of the thing that's the closest match to this. And so we can see these very precisely operating in small models and the thesis, which we're able to offer some support for in the new second paper that's coming out, is that these are a mechanism for how models match patterns, maybe even how they do what we call in context or few shot learning, which is a capability that models have had since GPT-2 and GPT-3. So yeah, that's interpretability. Yeah. Do you want me to go on to the next one or you could talk about that?

Lucas Perry: Sure. So before you move on to the next one, could you also help explain how difficult it is to interpret current models or whether or not it is difficult?

Dario Amodei: Yeah. I mean, I don't know, I guess difficult is in the eye of the beholder, and I think Chris Olah can speak to the details of this better than either of us can. But I think kind of watching from the outside and supervising this within Anthropic, I think the experience has generally been that whenever you start looking at some particular phenomenon that you're trying to interpret, everything looks very difficult to understand. There's billions of parameters, there's all these attention heads. What's going on? Everything that happens could be different. You really have no idea what's going on. And then there comes some point where there's some insight or set of insights. And you should ask Chris Olah about exactly how it happens or how he thinks of the right insights that kind of really almost offers a Rosetta stone to some particular phenomenon, often a narrow phenomenon, but these induction heads, they exist everywhere within small models, within large models.

They don't explain everything. I don't want to over-hype them, but it's a pattern that appears again and again and operates in the same way. And once you see something like that, then a whole swath of behavior that didn't make sense before starts to make some more sense. And of course, there's exceptions. They're only approximately true, there are many, many things to be found. But I think the hope in terms of interpreting models, it's not that we'll make some giant atlas of what each of the hundred billion weights in a giant model means, but that there will be some lower description length pattern that appears over and over again.

You could make an analogy to the brain or the cell or something like that, where, if you were to just cut up a brain and you're like, oh my God, this is so complex. I don't know what's going on. But then you see that there are neurons and the neurons appear everywhere. They have electrical spikes, they relate to other neurons, they form themselves in certain patterns that those patterns repeat themselves. Some things are idiosyncratic and hard to understand, but also there's this patterning. And so, I don't know, it's maybe an analogy to biology where there's a lot of complexity, but also there are underlying principles, things like DNA to RNA to proteins, or general intracellular signal regulation. So yeah, the hope is that they're at least some of these principles and that when we see them, everything gets simpler. But maybe not. We found those in some cases, but maybe as models get more complicated, they get harder to find. And of course, even within existing models, there's many, many things that we don't understand at all.

Lucas Perry: So can we move on then to alignment and societal impact?

Dario Amodei: Trying to align models by training them and particularly preference modeling, that's something that several different organizations are working on. There are efforts at DeepMind, OpenAI, Redwood Research, various other places to work on that area. But I think our general perspective on it has been kind of being very method agnostic, and just saying what are all the things we could do to make the models more in line with what would be good. Our general heuristic for it, which isn't intended to be a precise thing, is helpful, honest, harmless. That's just kind of a broad direction for what are some things we can do to make models today more in line with what we want them to do, and not things that we all agree are bad.

And so in that paper, we just went through a lot of different ways, tried a bunch of different techniques, often very simple techniques, like just prompting models or training on specific prompts, what we call prompt distillation, building preference models for some particular task or preference models from general answers on the internet. How good did these things do at, yeah, at simple benchmarks for toxicity, helpfulness, harmfulness, and things like that. So it was really just a baseline, like let's try a collection of all the dumbest stuff we can think of to try and make models more aligned in some general sense. And then I think our future work is going to build on that.

Societal impacts, that paper's probably going to come out in the next week or so. As I mentioned, it's called, the paper we're coming out with is called Predictability and Surprise in Generative Models. And yeah, basically there we're just making the point about this open-endedness and discussing both technical and policy interventions to try and yeah, to try and grapple with the open-endedness better. And I think future work in the societal impacts direction will focus on how to classify, characterize, and kind of, in a practical sense, filter or prevent these problems.

So, yeah, I mean, I think it's prototypical of the way we want to engage with policy, which is we want to come up with some kind of technical insight and we want to express that technical insight and explore the implications that it has for, yeah, for policy makers and for the ecosystem in the field. And so here, we're able to draw a line from hey, there's this dichotomy where these models scale very smoothly, but have unexpected behavior. The smooth scaling means people are really incentivized to build them and we can see that happening. The unpredictability means even if the case for building them is strong from a financial or accounting perspective, that doesn't mean we understand their behavior well. That combination is a little disquieting. Therefore we need various policy interventions to make sure that we get a good outcome from these things. And so, yeah, I think societal impacts is going to go in that general direction.

Lucas Perry: So in terms of the interpretability release, you released alongside that some tools and videos. Could you tell me why you chose to do that?

Daniela Amodei: Sure. Yeah. I can maybe jump in here. So it goes back sort of to some stuff we talked about a little bit earlier, which is that one of our major goals in addition to doing safety research ourselves, is to sort of help grow the field of safety, all different types of safety work sort of more broadly. And I think we ultimately hope that some of the work that we do is going to be adopted and even expanded on in other organizations. And so we chose to kind of release other things besides just an archive paper, because it hopefully will reach a wider number of people that are interested in these topics and in this case in interpretability. And so what we also released is, our interpretability team worked on something like I think it's 15 hours worth of videos, and this is just a more in-depth exploration of their research for their paper which is called A Mathematical Framework for Transformer Circuits.

And so the team tried to kind of make it like a lecture series. So if you imagine somebody from the interpretability team is asked to go give a talk at a university or something, maybe they talk for an hour and they reach a hundred students, but now these are publicly available videos. And so if you are interested in understanding interpretability in more detail, you can watch them on YouTube anytime you want. As part of that release, we also put out some tools. So we released a writeup on Garcon, which is the infrastructure tool that our team used to conduct the research, and PySvelte, which is a sample library, which is used to kind of create some of the interactive visualizations that the interpretability team is kind of known for. So we've been super encouraged that so we've seen other researchers and engineers playing around with the tools and watching the videos. And so we've already gotten some great engagement already, and our kind of hope is that this will lead to more people doing interpretability research or kind of building on the work we've done in other places.

Dario Amodei: Yeah. I mean, a way to add to that to kind of put it in broader perspective is different areas within safety are at, I would say, differing levels of maturity. I would say something like alignment or preference modeling or reward modeling or RL from human feedback, they're all names for the same thing. That's an area where there are several different efforts at different institutions to do this. We have kind of our own direction within that, but starting from the original RL from Human Preference paper that a few of us helped lead a few years ago, that's now branched out in several directions. So, we don't need to tell the field to work in that broad direction. We have our own views about what's exciting within it, and how to best make progress.

It's at a slightly more mature stage. Whereas I would say interpretability whereas many folks work on interpretability for neural nets, the particular brand of, let's try and understand at the circuit level what's going on inside these models, let's try and mechanistically kind of map them and break them down. I think there's less of that in the world and what we're doing is more unique. And, well, I mean, that's a good thing because we're providing a new lens on safety, but actually if it goes on too long, it's a bad thing because we want these things to spread widely, right. We don't want it to be dependent on one team or one person. And so when things are at that earlier stage of maturity, it makes a lot of sense to release the tools to reduce the barrier to other people and other institutions starting to work on this.

Lucas Perry: So you're suggesting that the, your interpretability research that you guys are doing is unique.

Dario Amodei: Yeah. I mean, I would just say it's at an earlier stage, yeah. I would just say that it's at an earlier stage of maturity. I don't think there are other kind of large organized efforts that are, yeah, that are kind of focused on, I would say, mechanistic interpretability and especially mechanistic interpretability for language models. We'd like there to be, and there are, we know of folks who are starting to think about it and that's part of why we released the tools. But I think, yeah, yeah, trying to mechanistically map and understand the internal principles inside large models, particularly language models, I think there's, yeah, I think there's less of that has been done in the broader ecosystem.

Lucas Perry: Yeah. So I don't really know anything about this space, but I guess I'm surprised to hear that. I imagine that industry with how many large models it's deploying, like Facebook or other people they'd be interested in, interpretability, interpreting their own systems.

Dario Amodei: Yeah. I mean, I think again, I don't want to, yeah, yeah, I don't want to give a misleading impression here. Interpretability is a big field and there's a lot of effort to like, why did this model do this particular thing? Does this attention head increase this activation by a large amount? People are interested in understanding the particular part of a model that led to a particular output. So there's a lot of area in this space, but I think the particular program of like, here's a big language model transformer, let's try and understand what are the circuits that drive particular behaviors? What are the pieces? How do the MLPs interact with the attention heads? The kind of, yeah, the kind of general mechanistic reverse engineering approach. I think that's less common. I don't want to say it doesn't happen, but it's less common, much less common.

Lucas Perry: Oh, all right. Okay. So I guess a little bit of a different question and a bit of a pivot here, something to explore. If people couldn't guess from the title of the podcast, you're both brother and sister.

Daniela Amodei: Yep.

Lucas Perry: Which is, so it was pretty surprising, I guess, in terms of, I don't know of any other AGI labs that are largely being run by a brother and sister, so yeah. What's it like working with your sibling?

Daniela Amodei: Yeah...

Lucas Perry: Do you guys still get along since childhood?

Daniela Amodei: That's a good question. Yeah. I can maybe start here and obviously I'm curious and hopeful for Dario's answer. I'm just kidding. But yeah, I think honestly, it's great. I think maybe a little bit of just history or background about us might be helpful, but Dario and I have always been really close. I think since we were very, very small, we've always had this special bond around really wanting to make the world better or wanting to help people. So originally started my career in international development, so very far away from the AI space, and part of why I got interested in that is that it was an interest area of Dario's at the time, and Dario was getting his PhD in a technical field and so wasn't working on this stuff directly, but I'm a few years younger than him and so I was very keen to understand the things that he was working or interested in as a potential area to have impact.

And so he was actually a very early GiveWell fan I think in 2007 or 2008, and we-

Lucas Perry: Oh, wow. Cool.

Daniela Amodei: Yeah, and so we were both still students then, but I remember us sitting, we were both home from college, or I was home from college and he was home from grad school and we would sit up late and talk about these ideas, and we both started donating small amounts of money to organizations that were working on global health issues like malaria prevention when we were still both in school. And so I think we've always had this uniting, top level goal of wanting to work on something that matters, something that's important and meaningful, and we've always had very different skills and so I think it's really very cool to be able to combine the things that we are good at into hopefully running an organization well. So for me, I feel like it's been an awesome experience. Now I feel like I'm sitting here nervously wondering what Dario's answer is going to be. I'm just kidding. But yeah, for the majority of our lives, I think we've wanted to find something to work together on and it's been really awesome that we've been able to at Anthropic.

Dario Amodei: Yeah, I agree with all that. I think what I would add to that is running a company requires an incredibly wide range of skills. If you think of most jobs, it's like, my job is to get this research result or my job is to be a doctor or something, but I think the unique thing about running a company, and it becomes more and more true the larger and more mature it gets is there's this just incredibly wide range of things that you have to do, and so you're responsible for what to do if someone breaks into your office, but you're also responsible for does the research agenda make sense and if some of the GPUs in the cluster aren't behaving, someone has to figure out what's going on at the level of the GPU kernels or the comms protocol that the GPUs talk to each other.

And so I think it's been great to have two people with complimentary skills to cover that full range. It seems like it'd be very difficult for just one person to cover that whole range, and so we each get to think about what we're best at and between those two things, hopefully it covers most of what we need to do. And then of course, we always try and hire people fo specialties that we don't know anything about. But it's made it a lot easier to move fast without breaking things.

Lucas Perry: That's awesome. So you guys are like an archon or you guys synergistically are creating an awesome organization.

Dario Amodei: That is what we aim for.

Daniela Amodei: That's the dream. Yeah, that's the dream.

Lucas Perry: So I guess beneath all of this, Anthropic has a mission statement and you guys are brother and sister, and you said that you're both very value aligned. I'm just wondering, underneath all that, you guys said that you were both passionate about helping each other or doing something good for the world. Could you tell me a little bit more about this more heart based inspiration for eventually ending up at and creating Anthropic?

Daniela Amodei: Yeah. Maybe I'll take a stab at this and I don't know if this is exactly what you're looking for, but I'll gesture in a few different directions here and then I'm sure Dario has a good answer as well, but maybe I'll just talk about my personal journey in getting to Anthropic or what my background looked like and how I wound up here. So I talked about this in just part of what united me and Dario, but I started my career working in international development. I worked in Washington DC at a few different NGOs, I spent time working in east Africa for a public health organization, I worked on a congressional campaign, I've worked on Capitol Hill, so I was much more in this classic, like a friend at an old job used to call me, the classic do-gooder. Of trying to alleviate global poverty, of trying to make policy level changes in government, of trying to elect good officials.

And I felt those causes that I was working in were deeply important, and really, to this day, I really support people that are working in those areas and I think they matter so much. And I just felt I personally wasn't having the level of impact that I was looking for, and I think that led me to through a series of steps. I wound up working in tech, and I mentioned this earlier but I started at this tech startup called Stripe. It was about 40 people when I joined and I really had the opportunity to see what it looks like to run a really well run organization when I was there. And I got to watch it scale and grow and be in this emerging area. And I think during my time there, something that became really apparent to me was just working in tech, how much of an impact this sector has on things like the economy, on human interaction, on how we live our lives in day to day ways. And Stripe, it's a payments company, it's not social media or something like that.

But I think there is a way that technology is a relatively small number of people having a very high impact in the world per person working on it. And I think that impact can be good or bad, and I think it was a pretty logical leap for me from there to think, wow, what would happen if we extrapolated that out to instead of it being social media or payments or file storage, to something significantly more powerful where there's a highly advanced set of artificial intelligence systems. What would that look like and who's working on this? So I think for me, I've always been someone who has been fairly obsessed with trying to do as much good as I personally can, given the constraints of what my skills are and where I can add value in the world.

And so I think for me, moving to work into AI looked... From early days, if you looked at my resume, you'd be like, how did you wind up here? But I think there was this consistent story or theme. And my hope is that Anthropic is at the intersection of this practical, scientific, empirical approach to really deeply understanding how these systems work, hopefully helping to spread and propagate some of that information more widely in the field, and to just help as much as possible to push this field in a safer and ideally, just hopefully all around robust, positive direction when it comes to what impact we might see from AI.

Dario Amodei: Yeah. I think I have a parallel picture here, which is I did physics as an undergrad, I did computational neuroscience in grad school. I was, I think, drawn to neuroscience by a mixture of, one, just wanting to understand how intelligence works, seems the fundamental thing. And a lot of the things that shape the quality of human life and human experience depend on the details of how things are implemented in the brain. And so I felt in that field, there were many opportunities for medical interventions that could improve the quality of human life, understanding things like mental illness and disease, while at the same time, understanding something about how intelligence works, because it's the most powerful lever that we have.

I thought of going into AI during those days, but I felt that it wasn't really working. This was before the days when deep learning was really working. And then around 2012 or 2013, I saw the results coming out of Google Brain, things like AlexNet and that they were really working, and saw AI both as, hey, this might be, one, the best way to understand intelligence, and two, the things that we can build with AI, by solving problems in science and health and just solving problems that humans can't solve yet by having intelligence that, first in targeted ways and then maybe in more general ways, matches and exceeds those of humans, can we solve the important scientific, technological, health, societal problems? Can we do something to ameliorate those problems? And AI seemed like the biggest lever that we had if it really worked well. But on the other hand, AI itself has all these concerns associated with it in both the short run and the long run. So we maybe think of it as we're working to address the concerns so that we can maximize the positive benefits of AI.

Lucas Perry: Yeah. Thanks a lot for sharing both of your perspectives and journeys on that. I think when you guys were giving to GiveWell I was in middle school, so...

Daniela Amodei: Oh, God. We're so old, Dario.

Dario Amodei: Yeah, I still think of GiveWell as this new organization that's on the internet somewhere and no one knows anything about it, and just me who-

Daniela Amodei: This super popular, well known-

Dario Amodei: Just me who reads weird things on the internet who knows about it.

Daniela Amodei: Yeah.

Lucas Perry: Well, for me, a lot of my journey into x-risk and through FLI has also involved the EA community, effective altruism. So I guess that just makes me realize that when I was in middle school, there was the seeds that were...

Dario Amodei: Yeah, there was no such community at that time.

Daniela Amodei: Yeah.

Lucas Perry: Let's pivot here then into a bit more of the machine learning, and so let see what the best way to ask this might be. So we've talked a bunch already about how Anthropic is emphasizing the scaling of machine learning systems through compute and data, and also bringing a lot of mindfulness and work around alignment and safety when working on these large scale systems that are being scaled up. Some critiques of this approach have described scaling from existing models to AGI as adding more rocket fuel to a rocket, which doesn't mean you're necessarily ready or prepared to land the rocket on the moon, or that the rocket is aimed at the moon.

Maybe this is lending itself to what you guys talked about earlier about the open-endedness of the system, which is something you're interested in working on. So how might you respond to the contention that there is an upward bound on how much capability can be gained through scaling? And then I'll follow up with the second question after that.

Dario Amodei: Yeah, so actually in a certain sense, I think we agree with that contention in a certain way. So I think there's two versions of what you might call the scaling hypothesis. One version, which I think of as the straw version or less sophisticated version, which we don't hold and I don't know if there's anyone who does hold it but probably there is, is just the view that we have our 10 billion parameter language model, we have a hundred billion parameter language model. Maybe if we make a hundred trillion parameter language model, that'll be AGI. So that would be a pure scaling view. That is definitely not our view. Even small modified forms like, well, maybe you'll change the activation function in the transformer you don't have to do anything other than that. I think that's just not right.

And you can see it just by seeing that the objective function is predicting the next word, it's not doing useful tasks that humans do. It's limited to language, it's limited to one modality. And so there are some very trivial, easy to come up with ways in which literally just scaling this is not going to get you to general intelligence. That said, the more subtle version of the hypothesis, which I think we do mostly hold, is that this is a huge ingredient of not only this, of whatever it is that actually does build AGI. So no one thinks that you're just going to scale up the language models and make them bigger, but as you do that, they'll certainly get better. It'll be easier to build other things on top of them.

So for example, if you start to say, well, you make this big language model and then you used RL with interaction with humans, to fine tune it on doing a million different tasks and following human instructions, then you're starting to get to something that has more agency, that you can point it in different directions, you can align it. If you also add multi-modality where the agent can interact with different modalities, if you add the ability to use various external tools to interact with the world and the internet. But within each of these, you're going to want to scale, and within each setup, the bigger you make the model, the better it's going to be at that thing.

So in a way, the rocket fuel analogy makes sense. Actually, the thing you should most worry about with rockets is propulsion. You need a big enough engine and you need enough rocket fuel to make the rocket go. That's the central thing. But of course, yes, you also need guidance systems, you also need all kinds of things. You can't just take a big vat of rocket fuel and an engine and put them on a launchpad and expect it to all work. You need to actually build the full rocket. And safety itself makes that point, that to some extent, if you don't do even the simplest safety stuff, then models don't even do the task that's intended for them in the simplest way. And then there's many more subtle safety problems.

But in a way, the rocket analogy is good, but it's I think more a pro scaling point than an anti scaling point because it says that scaling is an ingredient, perhaps a central ingredient in everything. Even though it isn't the only ingredient, if you're missing ingredients, you won't get where you're going, but when you add all the right ingredients, then that itself needs to be massively scaled. So that would be the perspective.

No one thinks that if you just take a bunch of rocket fuel in an engine and put it on a launch pad that you'll get a rocket that'll go to the moon, but those might still be the central ingredients in the rocket. Propulsion and getting out of the Earth's gravity well is the most important thing a rocket has to do. What you need for that is rocket fuel and an engine. Now you need to connect them to the right things, you need other ingredients, but I think it's actually a very good analogy to scaling in the sense that you can think of scaling as maybe the core ingredient, but it's not the only ingredient.

And so what I expect is that we'll come up with new methods and modifications. I think RL, model based URL, human interaction, broad environments are all pieces of this, but that when we have those ingredients, then whatever it is we make, we'll need to scale that multi-modality, we'll need to scale that massively as well. So scaling is the core ingredient, but it's not the only ingredient. I think it's very powerful alone, I think it's even more powerful when it's combined with these other things.

Lucas Perry: One of the claims that you made was that we won't get to AGI, people don't think we won't get to AGI just by scaling up present day systems. Earlier, you were talking about how we got... There these phase transitions, right? If you go up one order of magnitude in terms of the number or parameters in the system, then you get some kind of new ability, like arithmetic. Why is it that we couldn't just increase the order of magnitude of the number of parameters in the systems and just keep getting something that's smarter?

Dario Amodei: Yeah. So first of all, I think we will keep getting something that's smarter, but I think the question is will we get all the way to general intelligence? So I actually don't exclude it, I think it's possible, but I think it's unlikely, or at least unlikely in the practical sense. There are a couple of reasons. Today, when we train models on the internet, we train them on an average overall text on the internet. Think of some topic like chess. You're training on the commentary of everyone who talks about chess. You're not training on the commentary of the world champion at chess. So what we'd really like is something that exceeds the capabilities of the most expert humans, whereas if you train on all the internet, for any topic, you're probably getting amateurs on that topic. You're getting some experts but you're getting mostly amateurs.

And so even if the generative model was doing a perfect job of modeling its distribution, I don't think it would get to something that's better than humans at everything that's being done. And so I think that's one issue. The other issue is, or there's several issues, I don't think you're covering all the tasks that humans do. You cover a lot of them on the internet but there are just some tasks and skills, particularly related to the physical world that aren't covered if you just scrape the internet, things like embodiment and interaction.

And then finally, I think that even matching the performance of text on the internet, it might be that you need a really huge model to cover everything and match the distribution, and some parts of the distribution are more important than others. For instance, if you're writing code or if you're writing a mystery novel, a few words or a few things can be more important than everything else. It's possible to write a 10 page document where the key parts are two or three sentences, and if you change a few words, then it changes the meaning and the value of what's produced. But the next word prediction objective function doesn't know anything about that. It just does everything uniformly so if you make a model big enough, yeah they'll get that right but the limit might be extreme. And so things that change the objective function, that tell you what to care about, of which I think RL is a big example probably are needed to make this actually work correctly.

I think in the limit of a huge enough model, you might get surprisingly close, I don't know, but the limit might be far beyond our capabilities. There's only so many GPU's you can build and there are even physical limits.

Lucas Perry: And there's less of them, less and less of them available over time, or at least they're very expensive.

Dario Amodei: They're getting more expensive and more powerful. I think the price efficiency overall is improving, but yeah, they're definitely becoming more expensive as well.

Lucas Perry: If you were able to scale up a large scale system in order to achieve an amateur level of mathematics or computer science, then would it not benefit the growth of that system to then direct that capability on itself as a self recursive improvement process? Is that not already escape velocity intelligence once you hit amateurs?

Dario Amodei: Yeah. So there are training techniques that you can think of as bootstrapping a model or using the model's own capabilities to train it. Think like AlphaGo for instance was trained with a method called expert iteration that relies on looking ahead and comparing that to the model's own prediction. So whenever you have some coherent logical system, you can do this bootstrapping, but that itself is a method of training and falls into one of the things I'm talking about, about you make these pure generative models, but then you need to do something on top of them, and the bootstrapping is something that you can do on top of them. Now, maybe you reach a point where the system is making its own decisions and is using its own external tools to create the bootstrapping, to make better versions of itself, so it could be that that is someday the end of this process. But that's not something we can do right now.

Lucas Perry: So there's a lot of labs in industry who work on large models. There are maybe only a few other AGI labs, I can think of DeepMind. I'm not sure if there are others that... OpenAI. And there's also this space of organizations like The Future of Life Institute or the Machine Intelligence Research Institute or the Future of Humanity Institute that are interested in AI safety. MIRI and FHI both do research. FLI does grant making and supports research. So I'm curious as to, both in terms of industry and nonprofit space and academia, how you guys see Anthropic as positioned? Maybe we can start with you, Daniela.

Daniela Amodei: Sure, yeah. I think we touched on this a little bit earlier, but I really think of this as an ecosystem, and I think Anthropic is in an interesting place in the ecosystem, but we are part of the ecosystem. So I think our strength or the thing that we do best, and I like to think of all of these different organizations as having valuable things to bring to the table, depending on the people that work there, their leadership team, their particular focused research bet, or their mission and vision that they're achieving I think hopefully have the potential to bring safe innovations to the broader ecosystem that we've talked about. I think for us, our bet is one we've talked about, which is this empirical scientific approach to doing AI research and AI safety research in particular.

And I think for our safety research, we've talked about a lot of the different areas we focus on. Interpretability, alignment, societal impacts, scaling laws for empirical predictions. And I think a lot of what we're imagining or hoping for in the future is that we'll be able to grow those areas and potentially expand into others, and so I really think a lot of what Anthropic adds to this ecosystem or what we hope it adds is this rigorous scientific approach to doing fundamental research in AI safety.

Dario Amodei: Yeah, that really captures it in one sentence, which is I think if you want to locate us within the ecosystem, it's an empirical iterative approach within an organization that is completely focused on making a focused bet on the safety thing. So there are organizations like MIRI or to a lesser extent, Redwood, that are either not empirical or have a different relationship to empiricism than we do, and then there are safety teams that are doing good work within larger companies like DeepMind or OpenAI or Google Brain that are safety teams within larger organizations. Then I have lots of folks who work on short term issues, and then we're filling a space that's working on today's issues but with an eye towards the future, empirically minded, iterative, with an org where everything we do is designed for the safety objective.

Lucas Perry: So one facet of Anthropic is that it is a public benefit corporation, which is a structure that I'm not exactly sure what it is and maybe many of our listeners are not familiar with what a public benefit corporation is. So can you describe what that means for Anthropic, its work, its investors and its trajectory as a company?

Daniela Amodei: Yeah, sure. So this is a great question. So what is a PBC? Why did we choose to be a public benefit corporation? So I think I'll start by saying we did quite a lot of research when we were considering what type of corporate entity we wanted to be when we were founding. And ultimately, we decided on PBC, on public benefit corporation for a few reasons. And I think primarily, it allowed us the maximum amount of flexibility in how we can structure the organization, and we were actually very lucky, to a later part of your question, to find both investors and employees who were generally very on board with this general vision for the company. And so what is a public benefit corporation? Why did we choose that structure?

So they're fairly similar to C corporations, which is any form of standard corporate entity that you would encounter. And what that means is we can choose to focus on research and development, which is what we're doing now, or on deployment of tools or products, including down the road for revenue purposes if we want to. But the major difference between a PBC and a C corporation is that in a public benefit corporation, we have more legal protections from shareholders if the company fails to maximize financial interests in favor of achieving our publicly beneficial mission. And so this is primarily a legal thing, but it also was very valuable for us in being able to just appropriately set expectations for investors and employees, that if financial profit and creating positive benefit for the world were ever to come into conflict, it was legally in place that the latter one would win.

And again, we were really lucky that investors, people that wanted to work for us, they said, wow, this is actually something that's a really positive thing about Anthropic and not something that we need to work around. But I think it ended up just being the best overall fit for what we were aiming for.

Lucas Perry: So usually, there's a fiduciary responsibility that people like Anthropic would have to its shareholders, and because it's structured as a public benefit corporation, the public good can outweigh the fiduciary responsibility without there being legal repercussions. Is that right?

Daniela Amodei: Yeah, exactly. So shareholders can't come sue the company and say, hey, you didn't maximize financial returns for us. If those financial returns were to come into conflict with the publicly beneficial value of the company. So I think maybe an example here, I'll try and think of one off the top of my head, but if we designed a language model and we felt like it was unsafe, it was producing outputs that we felt were not in line with what we wanted to see from outputs of a language model, for safety reasons or toxicity reasons for any number of reasons. And in a normal C corporation, someone could say, "Hey, we're a shareholder and we want the financial value that you could create from that by productizing it." But we said, "Actually, we want to do more safety research on it before we choose to put it out into the world," in a PBC, we're quite legally protected basically in a case that. And again, I'm not a lawyer but that's my understanding of the PBC.

Dario Amodei: Yeah. A useful, holistic way to think about it is there's the legal structure, but I think often, these things, maybe the more important thing about them is that they're a way to explain your intention, to set the expectations for how the organization is going to operate. Often, things like that and the expectations of the various stakeholders, and making sure that you give the correct expectations and then deliver on those expectations so no one is surprised by what you're doing and all the relevant stakeholders, the investors, the employees, the outside world gets what they expect from you, that can often be the most important thing here. And so I think what we're trying to signal here is on one hand, a public benefit corporation, it is a for-profit corporation.

We could deploy something. That is something that we may choose to do and it has a lot of benefits in terms of learning how to make models more effective, in terms of iterating. But on the other hand, the mission is really important to us and we recognize that this is an unusual area, that's more fraught with market externalities would be the term that I would use, of all kinds. In the short term, in the long term, related to alignment, related to policy and government than a typical area. It's different than making electric cars or making widgets or something that, and so that's the thing we're trying to signal.

Lucas Perry: What do you think that this structure potentially means for the commercialization of Anthropic's research?

Daniela Amodei: Yeah, I think again, part of what's valuable about a public benefit corporation is that it's flexible, and so it is a C corporation, it's fairly close to any standard corporate entity you would meet and so the structure doesn't really have much of a bearing outside of the one that we just talked about on decisions related to things like productization, deployment, revenue generation.

Lucas Perry: Dario, you were just talking about how this is different than making widgets or electric cars, and one way that it's different from widgets is that it might lead to massive economic windfalls.

Dario Amodei: Yeah.

Lucas Perry: Unless you make really good widgets or widgets that can solve problems in the world. So what is Anthropic's view on the vast economic benefits that can come from powerful AI systems? And what role is it that you see C company AGI labs playing in the beneficial use of that windfall?

Dario Amodei: Daniela, you want to go...

Daniela Amodei: Go for it.

Dario Amodei: Yeah. So yeah, I think a way to think about it is, assuming we can avoid the alignment problems and some other problems, then there will be massive economic benefits from AI or AGI or TAI or whatever you want to call it, or just AI getting more powerful over time.

And then again, thinking about all the other problems that I haven't listed, which is today's short term problems and problems with fairness and bias, and long-term alignment problems and problems that you might encounter with policy and geopolitics. Assuming we address all those, then there is still this issue of economic... Like are those benefits evenly distributed?

And so here, as elsewhere, I think it's unlikely those benefits will all accrue to one company or organization. I think this is bigger than one company or one organization, and is a broader societal problem. But we'd certainly like to do our part on this and this is something we've been thinking about and are working on putting programs in place with respect to. We don't have anything to share about it at this time, but this is something that's very much on our mind.

I would say that, more broadly, I think the economic distribution of benefits is maybe one of only many issues that will come up. Which is the disruptions to society that you can imagine coming from the advent of more powerful intelligence are not just economic. They're already causing disruptions today. People already have legitimate and very severe societal concerns about things that models are doing today and you can call them mundane relative to all the existential risk. But I think they're already serious concerns about concentration of power, fairness and bias in these models, making sure that they benefit everyone, which I don't think that they do yet.

And if we then put together with that, the ingredient of the models getting more powerful, maybe even on an exponential curve, those things are set to get worse without intervention. And I think economics is only one dimension of that. So, again, these are bigger than any one company. I don't think it's within our power to fix them, but we should do our part to be good citizens and we should try and release applications that make these problems better rather than worse.

Lucas Perry: Yeah. That's excellently put. I guess one thing I'd be interested in is if you could, I guess, give some more examples about these problems that exist with current day systems and then the real relationship that they have to issues with economic windfall and also existential risk.

I think it seems to me like tying these things together is really important. At least seeing the interdependence and relationship there, some of these problems already exist, or we already have example problems that are really important to address. So could you expand on that a bit?

Dario Amodei: I think maybe the most obvious one for current day problems is people are worried, very legitimately, that big models suffer from problems of bias, fairness, toxicity, and accuracy. I'd like to apply my model in some medical application and it gives the wrong diagnosis, or it gives me misinformation or it fabricates information. That's just not good. These models aren't usable and they're harmful if you try and use them.

I think toxicity and bias are issues when models are trained on data from the internet. They absorb the biases of that data. And there's maybe even more subtle algorithmic versions of that, where, I hinted at it a little before, where it's like the objective function of the model is to say something it sounds like what a human would say or what a human on the internet would say. And so in a way, almost fabrication is kind of like baked into the objective function.

Potentially, even bias and stereotyping you can imagine being baked into the objective function in some way. So, these models want to be used for very mundane everyday things like helping people write emails or helping with customer surveys or collecting customer data. And if they're subtly biased or subtly inaccurate, then those biases and those inaccuracies will be inserted into the stream of economic activity in a way that may be difficult to detect. So, that seems bad and I think we should try to solve those problems before we deploy the models. But also they're not as different from the large scale problems as they might seem.

In terms of the economic inequality, I don't know, just look at the market capitalization of the top five tech companies in the world. And compare that to the US economy. There's clearly something going on in the concentration of wealth.

Daniela Amodei: I would just echo everything Dario said. And also add, I think something that especially can be alarming in sort of a short term way today in the sense that it could belie things to come, is how quietly and seamlessly people are becoming dependent on some of these systems. We don't necessarily even know, there's no required disclosure of when you're interacting with an AI system versus a human and until very recently, that was sort of a comical idea because it was so obvious when you were interacting with a person versus not a person. You know when you're on a customer chat and it's a human on the other end versus an automated system responding to you.

But I think that line is getting increasingly blurred. And I can imagine that even just in the next few years, that could start to have fairly reasonably large ramifications for people in day-to-day ways. People talk to an online therapist now, and sometimes that is backed by an AI system that is giving advice. Or down the road, we could imagine things looking completely different in health realms, like Dario talked about.

And so I think it's just really important as we're stepping into this new world to be really thoughtful about a lot of the safety problems that he just outlined and talked about because I think, I don't know that most people necessarily even know all the ways in which AI is impacting our kind of day-to-day lives today, and the potential that could really go up in the near future.

Lucas Perry: The idea of AIs, there being like a requirement of AI is disclosing themselves as AI seems very interesting and also adjacent to this idea of the way that C corporations have fiduciary responsibility to shareholders, having AI systems that also have some kinds of responsibility towards the people that they serve, where they can't be secretly working towards the interests of the tech company that has the AI listening to you in your house all the time.

Dario Amodei: Yeah. It's another direction you can imagine. It's like I talked to an AI produced by Megacorp but it subtly steers to my life to the benefit of Megacorp. Yeah, there's lots of things you can come up with like this.

Daniela Amodei: These are important problems today. And I think they also really belie things that could be coming in the near future, and I think solving whatever, those particular problems are ones lots of groups are working on, but I think helping to solve a lot of the fundamental building blocks underlying them; about getting models to be truthful, to be harmless, to be honest. A lot of the goals are aligned there, both for sort of short, medium and potentially long-term safety.

Lucas Perry: So Dario, you mentioned earlier that of the research that you publish, one of your hopes is that other organizations will look into and expand the research that you're doing. I'm curious if Anthropic has a plan to communicate its work and its ideas about how to develop AGI safely with both technical safety researchers, as well as with policy makers.

Daniela Amodei: Yeah, maybe I'll actually jump in on this one, and Dario feel free to add as much as you like. But I actually think this is a really important question. I think communication with policy makers about safety with other labs in the form of papers that we publish is something that's very important to us at Anthropic.

We have a policy team, it's like 1.5 people right now. So we're hiring, that's kind of a plug as well, but I think their goal is to really take the technical content that we are developing at Anthropic and translate that into something that is actionable and practical for policymakers. And I think this is really important because the concepts are very complex, and so it's a special skill to be able to take things that are highly technical, potentially very important, and translate that into recommendations or work with policy makers to come up with recommendations that could potentially have very far reaching consequences.

So, to point to a couple of things we've been working on here, we've been supporting NIST, which is the National Institute for Standards and Technology on developing something called an AI Risk Management Framework. And the goal of that is really developing more monitoring tools around AI risk and AI risk management. We've also been supporting efforts in the US and internationally to think about how we can best support academic experimentation, which we talked about a little bit earlier with large scale compute models too.

Lucas Perry: You guys also talked a lot about open-endedness, and was part of all this alignment and safety research looking into ways of measuring safety and open-endedness?

Daniela Amodei: Yeah, there's actually some interesting work which I think is also in this upcoming paper and in various other places that we've been looking into around the concept of AI evaluations or AI monitoring. And I think both of those are potentially really important because a lot of what we're seeing, or maybe lacking, and this kind of goes back to this point I made earlier about standards is, how do we even have a common language or a common framework within the AI field of what outputs or metrics we care about measuring.

And until we have that common language or framework, it's hard to set things like standards across the industry around what safety even means. And so, I think AI evaluations is another area that our societal impacts team, which is also like the other half of the one and a half people in policy, it's also 1.5 people, is something that they've been working on as well.

Lucas Perry: Right, so a large part of this safety problem is of course the technical aspect of how you train systems and create systems that are safe and aligned with human preferences and values. How do you guys view and see the larger problem of AI governance and the role and importance of governments and civil society in working towards the safe and beneficial use and deployment of AI systems?

Daniela Amodei: We talked about this one a little bit earlier, and maybe I'll start here. And obviously, Dario jump in if you want. But I do think that these other kind of institutions that you talked about have this really important role to play. And again, one of the things we mention in this paper is that we think government has already been starting to fund a lot more academic safety research. And I think that's an area that we... A concrete policy recommendation is, hey, go do more of that. That would be great.

But I also think groups like civil society and NGOs, there's a lot of great organizations in this space, including FLI and others, that are thinking about what do we do? Say we develop something really powerful, what's the next step? Whether that's at an industry lab, in government, in academia, wherever. And I think there's a way that industry incentives are not the same as nonprofit groups or as civil society groups. And I think to go back to this analogy of an ecosystem, we really need thoughtful and empowered organizations that are working on these kinds of questions, fundamentally outside of the industry sphere, in addition to the policy research and work that's being done at labs.

Dario Amodei: Yeah, another way you can think of things in line with this is I think maybe at some point laws and regulations are going to be written. And I think probably those laws and regulations work best if they end up being formalizations of what's realized to be the best practices, and those best practices can come from different industrial players, they can come from academics figuring out what's good and what's not. They can come from nonprofit players. But if you try and write a law ahead of time, often you don't know what... If you write a law that relates to a technology that hasn't been invented yet, it's often not clear what the best thing to do is, and what is actually going to work or make sense, or even what categories or words to use.

But if something has become a best practice and folks have converged on that, and then the law formalizes it and puts it in place, that can often be a very constructive way for things to happen.

Lucas Perry: Anthropic has received an impressive amount of series A funding. And so it seems like you guys are doing a lot of hiring and growing considerably. So, in case there's anyone from our audience that's interested in joining Anthropic, what are the types of roles that you expect to be hiring for?

Daniela Amodei: Yes, great question. We are definitely hiring. We're hiring a lot. And so I think the number one thing I would say is if you're listening to this podcast and you're interested, I would highly recommend just checking out our jobs page, because that will be the most up to date. And that's just anthropic.com on the careers tab. But we can also send that around if that's helpful.

But what are we looking to hire? Quite a few things. So most critically, probably right now, we're looking to hire engineers and we're actually very bottle-necked on engineering talent right now. And that's because running experiments on AI systems is something that requires a lot of custom software and tooling. And while machine learning experience is helpful for that, it isn't necessarily required.

And I think a lot of our best ML engineers or research engineers came from a software engineering or infrastructure engineering background, hadn't necessarily worked in ML before, but were just really excited to learn. So, I think if that describes you, if you're a software engineer, but you're really interested in these topics, definitely think about applying because I think there's a lot of value that your skills can provide.

We're also looking for just a number of other roles. I won't be able to list them all, you should just check out our jobs page. But off the top of my head, we're looking for front-end engineers to help with things like interfaces and tooling for the research we're doing internally. We're looking for policy experts, operations people, security engineers, data visualization people, security.

Dario Amodei: Security.

Daniela Amodei: Security, yes. We're definitely looking-

Dario Amodei: If you're building big models.

Daniela Amodei: Yes. Security is something that I think is-

Dario Amodei: Every industrial lab should make sure their models are not stolen by bad actors.

Daniela Amodei: This is a unanimous kind of thing across all labs. There's something everyone really agrees on in industry and outside of industry, which is that security is really important. And so, if you are interested in security or you have a security background, we would definitely love to hear from you, or I'm sure our friends at other industry labs and non-industry labs would also love to hear from you.

I would also say, I sort of talked about this a little bit before, but we've also just kind of had a lot of success in hiring people who were very accomplished in other fields, especially other technical fields. And so, we've alluded a few times to former recovering physicists or people who have PhDs in computer science or ML, neuroscientists, computational biologists.

And so, I think if you are someone who has this strong background and set of interest in a technical field that's not related to ML, but sort of moderately adjacent, I would also consider applying for our residency program. And so I think again, if you're even a little curious, I would say, just check out our jobs page, because there's going to be more information there, but those are the ones off the top of my head. And Dario, if I missed any, please jump in.

Dario Amodei: Yeah, that covers a pretty wide range.

Lucas Perry: Could you tell me a little bit more about the team and what it's like working at Anthropic?

Daniela Amodei: Yeah, definitely. You'll probably have to cut me off here because I'll talk forever about this because I think Anthropic is a great team. Some basic stats, we're about 35 people now. Like I said a few times, we've kind of come from a really wide range of backgrounds. So this is people who worked in tech companies as software engineers. These are former academics in physics, ethics, neuroscience, a lot of different areas, machine learning researchers, policy people, operations staff, so much more.

And I think one of the unifying themes that I would point to in our employees is a combination of a set of two impulses that I think we've talked about a lot in this podcast. And I think the first is really just a genuine desire to reduce the risks and increase the potential benefits from AI. And I think the second is a deep curiosity to really scientifically and empirically describe, understand, predict, model-out how AI systems work and through that deeper understanding, make them safer and more reliable.

And I think some of our employees identify as effective altruists which means they're especially worried about the potential for long term harms from AI. And I think others are more concerned about immediate or sort of emerging risks that are happening today or in the near future. And I think both of those views are very compatible with the goals that I just talked about. And I think they often just call for a mixed-method approach to research, which I think is a very accurate description of how things look in a day-to-day way at Anthropic.

It's a very collaborative environment. So, there's not a very strong distinction between research and engineering, researchers write code, engineers contribute to research. There's a very strong culture of pair programming across and within teams. There's a very strong focus on learning. I think this is also just because so many of us come from backgrounds that were not necessarily ML focused in where we started.

So people run these very nice, little training courses. Where they'll say, "Hey, if you're interested in learning more about transformers, I'm a transformer's expert and I'll walk you through it at different levels of technical skills so that people from the operations team or the policy team can come for an introductory version."

And then I think outside of that, I like to think we're a nice group of people. We all have lunch together every day. We have this very lovely office space in San Francisco, it's fairly well attended. And I think we have lots of fun lunch conversations ranging from things like... A recent one was we were sort of talking about microCOVID, if you know the concept of microCOVID, Catherine Olsson, who's of one of the creators of microcovid.org. Which is basically a way of assessing the level of risk from a given interaction or a given activity that you're doing during COVID time.

So we had this fun meta conversation where we're like, "How risky is this conversation that we're having right now from a microCOVID perspective, if we all came into the office and tested, but we're still together indoors and there's 15 of us, what does that mean?" So anyway, I think it's a fun place to work. We've obviously had a lot of fun getting to build it together.

Dario Amodei: Yeah. The things that stand out to me are trust and common purpose. They're enormous force multipliers where it shows up in all kinds of little things where if you have... You can think about it in things like compute allocation. If people are not on the same page, if one person wants to advance one research agenda, the other wants to advance their other research agenda, then people fight over it. And there's a lot of zero sum or negative sum interactions.

But if everyone has the attitude of, we're trying to do this thing, everything we're trying to do is in line with this common purpose and we all trust each other to do what's right to advance this common purpose, then it really becomes a force multiplier on getting things done while keeping the environment comfortable, and while everyone continues to get along with each other. I think it's an enormous superpower that I haven't seen before.

Lucas Perry: So, you mentioned that you're hiring a lot of technical people from a wide variety of technical backgrounds. Could you tell me a little bit more about your choice to do that rather than simply hiring people who are traditionally experienced in ML and AI?

Daniela Amodei: Yeah, that's a great question. So I should also say we have people from both camps that you talked about, but why did we choose to bring people in from outside the field? I think there's a few reasons for this. I think one is, again, ML and AI is still a fairly new field. Not super new, but still pretty new. And so what that means is there's a lot of opportunity for people who have not necessarily worked in this field before to get into it. And I think we've had a lot of success or luck with taking people who are really talented in a related field and helping to take their skills and translate them to the ones in ML and AI safety.

And I think the second reason is, so one is just expanding the talent pool. I think the other is, it really does broaden the range of perspectives and the types of people who are working on these issues, which we think are very important. And again, we've talked about this previously, but having a wider range of views and perspectives and approaches tends to lead to a more robust approach to doing both basic research and safety research.

Dario Amodei: Yeah. Nothing to add to that. I'm surprised at how often someone who has experience in a different field can come in, and it's not like they're directly applying things that come, but they think about things in a different way. And of course this is true about all kinds of things, this is this true about diversity in the more traditional senses as well. But you want as many different kinds of people as you can get.

Lucas Perry: So as we're wrapping up here, I'm curious just to get some more perspective on you guys about, given these large scale models, the importance of safety and alignment and the problems which exist today, but also the promises of the impact they could have for the benefit of people. What's a future that each of you is excited about or what's a future that you're hopeful for? Given your work at Anthropic and the future impacts of AI?

Daniela Amodei: Yeah, I'll start. So I think one thing I do believe is actually I am really hopeful about the future. I know that there's a lot of challenges that we have to face to get to a potentially really positive place. But I think the field will rise to the occasion, or that's kind of my hope. And I think some things I'm hoping for in the next few years is that a lot of different groups will be developing more practical tools, techniques for advancing safety research. And I think these are likely to hopefully become more widely available if we can set the right norms in the community. And I think the more people working on safety-related topics, that can positively feed on itself.

And I think I'm most broadly hoping for a world where we can feel confident that when we're using AI for more advanced purposes, like accelerating scientific research, that it's behaving in ways where we can be very confident and sure that we understand that it's not going to lead to negative, unintended consequences.

And the reason for that is because we've really taken the time to chart them out and understand what all of those potential problems could be. And so I think that's obviously a very ambitious goal, but I think if we can make all of that happen, there's a lot of potential benefits of more advanced AI systems that I think could be transformative for the world, from almost anything you can name; renewable energy, health, disease detection, economic growth, and lots of other just day-to-day enhancements to how we work and communicate and live together.

Dario Amodei: No one really knows what's going to happen in the future. It's extremely hard to predict. And so I often find any question about the future, it's more about the attitude or posture that you want to take than it is about concrete predictions, because I feel like particularly after you go a few years out, it's just very hard to know what's going to happen. And so, it's mostly just speculation. And so in terms of attitude, I think, well, first of all, I think the two attitudes that I find least useful are blind pessimism and blind optimism because they're actually sort of like doom saying and Pollyannaism. It weirdly is possible to have both at once.

But I think it's just not very useful because it's like we're all doomed. It's intended to create fear or it's intended to create complacency. I find that an attitude that's more useful is to just say, "Well, we don't know what's going to happen, but let's, as an individual or as an organization, let's pick a place where there's a problem we think we can help with and let's try and make things go a little better than they would've otherwise." Maybe we'll have a small impact, maybe we'll have a big impact, but instead of trying to understand what's going to happen with the whole system, let's try and intervene in a way that helps with something that we feel well-equipped to help with. And of course, the whole outcome, it's going to be beyond the scope of one person, one organization, even one country.

But I think we find that to be a more effective way of thinking about things. And for us, that's can we help to address some of these safety problems that we have with AI systems in a way that is robust and enduring and that points towards the future? If we can increase the probability of things going well by only some very small amount, that may well be the most that we can do.

I think from our perspective, the things that I would really like to see are, I would like it if AI could advance science technology and health in a way that's equitable for everyone, and that it could help everyone to make better decisions and improve human society. And right now, I, frankly, don't really trust the AI systems we build today to do any of those things, even if it were technically capable of the task, which it's not, I wouldn't trust it to do those things in a way that makes society better rather than worse.

And so I'd like us to do our part to make it more likely that we could trust AI systems in that way. And if we can make a small contribution to that while being good citizens in the broader ecosystem, that's maybe the best we can hope for.

Lucas Perry: All right. And so if people want to check out more of your work or to follow you on social media, where are the best places to do that?

Daniela Amodei: Yeah. On anthropic.com is going to be the best place to see most of the recent stuff we've worked on. I don't know if we have everything posted, but-

Dario Amodei: We have several papers out, so we're now about to post links to them on the website.

Daniela Amodei: In an easy to find place. And then we also have a Twitter handle. I think it's Anthropic on Twitter, and we generally also tweet about our recent releases of our research.

Dario Amodei: We are relatively low key. We really want to be focused on the research and not get distracted. I mean, the stuff we do is out there, but we're very focused on the research itself and getting it out and letting it you speak for itself.

Lucas Perry: Okay. So, where's the best place on Twitter to follow Anthropic?

Daniela Amodei: Our Twitter handle is @anthropicAI.

Lucas Perry: All right. I'll include a link to that in the description of wherever you're listening. Thanks a ton for coming on Dario and Daniela, it's really been awesome and a lot of fun. I'll include links to Anthropic in the description. It's a pleasure having you and thanks so much.

Daniela Amodei: Yeah, thanks so much for having us, Lucas. This was really fun.

LESSWRONG
LW

LESSWRONG
LW

46

Podcast Transcript: Daniela and Dario Amodei on Anthropic

46

Highlights

Transcript

46

46