+1. A few other questions I'm interested in:
I'm sympathetic to the fact that it might be costly (in terms of time and possibly other factors like reputation) to respond to some of these questions. With that in mind, I applaud DeepMind's alignment team for engaging with some of these questions, I applaud OpenAI for publicly stating their alignment plan, and I've been surprised that Anthropic has engaged the least (at least publicly, to my knowledge) about these kinds of questions.
A "Core Views on AI Safety" post is now available at https://www.anthropic.com/index/core-views-on-ai-safety
(Linkpost for that is here: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety.)
Here's a wild guess. They just "stole" a bunch of core people from OpenAI, that doesn't happen to any organization without tension and bad feelings. Now they are in direct competition to OpenAI for funding, staff, and press coverage. Even worse!
Perhaps they made peace and agreed not to make public releases for some time. Or it could be they want to differentiate themselves before they release their strategy.
Thanks for writing this up! It seems very helpful to have open, thoughtful discussions about different strategies in this space.
Here is my summary of Anthropic’s plan, given what you’ve described (let me know if it seems off):
Leaving aside concerns about arms races and big models being scary in and of themselves, this seems like a pretty reasonable approach to me. In particular, I’m pretty on board with points 1, 2, and 3—i.e., if you don’t have theories, then getting your feet wet with the actual systems, observing them, experimenting, tinkering, and so on, seems like a pretty good way to eventually figure out what’s going on with the systems in a more formal/mechanistic way.
I think the part I have trouble with (which might stem from me just not knowing the relevant stuff) is point 4. Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones. But I’ve never seen, e.g., a list of such things and why they are important or cruxy enough to justify the emphasis on large models given the risks involved. I would really like to see such an argument! (Perhaps it does exist and I am not aware).
I also have a bit of trouble with the “top player” framing—at the moment I just don’t see why this is necessary. I understand that Anthropic works on large models, and that this is on par with what other “top players” in the field are doing. But why not just say that you want to work with large models? Why mention being competitive with Deepmind or OpenAI at all? The emphasis on “top player” makes me think that something is left unsaid about the motivation, aside from the emphasis on current systems. To the extent that this is true, I wish it were stated explicitly. (To be clear, "you" means Anthropic, not Miranda).
Your summary seems fine!
Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones.
I think that Anthropic's current work on RL from AI Feedback (RLAIF) and Constitutional AI is based on large models exhibiting behaviors that don't work in smaller models? (But it'd be neat if someone more knowledgeable than me wanted to chime in on this!)
My current best understanding is that running state of the art models is expensive in terms of infrastructure and compute, the next generation models will get even more expensive to train and run, and Anthropic doesn't have (and doesn't expect to realistically be able to get) enough philanthropic funding to work on the current best models let alone future ones – so they need investment and revenue streams,
There's also a consideration that Anthropic wants to have influence in AI governance/policy spaces, where it helps to have a reputation/credibility as one of the major stakeholders in AI work.
Anthropic’s internal culture supports all of its staff in expressing and talking about their doubts, and questioning whether deploying an advanced system or publishing a particular paper might be harmful, and these doubts are taken seriously.
I'm interested in asking why you believe what you believe here.
As am I. So many organization's have a whistleblower policy or a safety culture. I'm worked in industry and to put it gently, how these cultures work in practice can be quite a bit different that the stated intention.
It's because from a management perspective letting anyone ask questions has to be balanced against getting things done and having a some top down leadership.
Note that that was inside "Staff at Anthropic list the following as protective factors".
I'd be curious what the OP and what the staff would say more specifically here. "Doubts are taken seriously" is quite a large range, from "can change the overall strategy" to "is diplomatically 'listened to' to use up any dissenting energy". E.g. what would happen at Anthropic with inquiries that could lead to changing the whole strategic direction, as in, "what if we shouldn't be advancing capabilities?"?
My sense is that it's been somewhere in between – on some occasions staff have brought up doubts, and the team did delay a decision until they were addressed, but it's hard to judge how much the end result was a different decision from what would have been made otherwise, versus just happening later.
The sense I've gotten of the culture is compatible with (current) Anthropic being a company that would change their entire strategic direction if staff started coming in with credible arguments that "what if we shouldn't be advancing capabilities?", but I think this hasn't yet been put to the test – people who choose work at Anthropic are going to be selected for agreeing on the premises behind the Anthropic strategy – and it's hard to know for sure how it would go.
Anthropic’s founding team consists of, specifically, people who formerly led safety and policy efforts at OpenAI
This claim seems misleading at best: Dario, Anthropic's founder and CEO, led OpenAI's work on GPT-2 and GPT-3, two crucial milestone in terms of public AI capabilities.
Given that I don't have much time to evaluate each claim one by one, and Gell-Mann amnesia, I am a bit more skeptical of the other ones.
Was Dario Amodei not the former head of OpenAI’s safety team?
He wrote "Concrete Problems in AI Safety".
I don't see how the claim isn't just true/accurate. Whether or not he led/contributed to the GPT series, (I an under the impression that) Dario Amodei did lead safety efforts at OpenAI.
Was Dario Amodei not the former head of OpenAI’s safety team?
He wrote "Concrete Problems in AI Safety".
I don't see how the claim isn't just true/accurate.
If someone reads "Person X is Head of Safety", they wouldn't assume that the person led the main AI capabilities efforts of the company for the last 2 years.
Only saying "head of the safety team" implies that this was his primary activity at OpenAI, which is just factually wrong.
According to his LinkedIn, from 2018 until end of 2020, when he left, he was Director of Research and then VP of Research of OpenAI, where he "set overall research direction" and "led the effort to build GPT2 and 3". He led the safety team before that, between 2016 and 2018.
I do think it's fair to consider the work on GPT-3 a failure of judgement and a bad sign about Dario's commitment to alignment, even if at the time (also based on LinkedIn) it sounds like he was also still leading other teams focused on safety research.
(I've separately heard rumors that Dario and the others left because of disagreements with OpenAI leadership over how much to prioritize safety, and maybe partly related to how OpenAI handled the GPT-3 release, but this is definitely in the domain of hearsay and I don't think anything has been shared publicly about it.)
They’ve recently been hiring for a product team, in order to get more red-teaming of models and eventually have more independent revenue streams.
I think Anthropic believes that this is the most promising route to making AGI turn out well for humanity, so it’s worth taking the risk of being part of the competition and perhaps contributing to accelerating capabilities.
On a reread, I noticed that I don't actually know what Anthropic's strategy is. This is actually a question about a couple of things.
The first is what endpoint they're targeting - "solve and implement alignment" is the ultimate goal, of course, but one can coherently imagine targeting something else, as with Encultured, which is explicitly not targeting "solve alignment" but a much smaller subset of what they expect will be a larger ecosystem adding up to a "solution".
The second is what strategy they're currently following in persuit of that endpoint.
There are some details that can be extracted based on the implied premises it relies on, but it would be great to hear from Anthropic directly what the current strategy is, in a way that either rules out substantial chunks of action-space, or requires very specific actions. (I think that in a very meaningful sense, a strategy is a special case of a prediction, which must constrain your expectations about your future actions.)
This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I just want to add that "whether you should consider applying" probably depends massively on what role you're applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Not saying you intended this, but I worry about people thinking "it's a an alignment role and therefore good" when considering joining companies that are pushing state of the art, and not thinking about it much harder than that.
What else should people be thinking about? You'd want to be sure that you'll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
People should be thinking about:
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and "meaning" means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
"is this a role from which I will push forward alignment faster than I advance capabilities?" is a very different question than "does this job have 'alignment' in the title?" I assume when it's put like that you would choose the first question, but in practice a lot of people take jobs that are predictable nos to the first question and justify it by claiming they're alignment jobs. Given that, I think it's good Ruby pushed back on something that would end up supporting the latter form of the question, even if it wasn't intended.
Anthropic’s corporate structure is set up to try to mitigate some of the incentives problems with being a for-profit company that takes investment (and thus has fiduciary duties, and social pressure, to focus on profitable projects.) They do take investment and have a board of stakeholders, and plan to introduce a structure to ensure mission continues to be prioritized over profit.
Is there anything specifically about their corporate structure now that mitigates the incentive problems? I know they are a public benefit corporation, but many of us are unclear on what that actually means besides "Anthropic thinks they have a good mission" - since as you point out they're still a for-profit company with investors. (I actually wasn't able to find any info about Anthropic's board when I searched recently, so the "board of stakeholders" is news to me.)
I know there is a ton involved in building a company like this, so it's ok if they really do have plans to set up a more beneficial structure and just haven't gotten around to it. But since the stakes with AGI are so high, it would be really nice to know more about what those plans are and to see them implemented so that we're not just taking their word for it.
Thanks for doing this post series btw, it's a really great discussion for us to get to have.
Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856
Yes, benefit corporation were created to provide an alternative to "shareholder primacy", otherwise widely accepted in law and custom, per Wikipedia: Benefit_corporation#Differences_from_traditional_corporations. Further quoting:
By contrast, benefit corporations expand the fiduciary duty of directors to require them to consider non-financial stakeholders as well as the interests of shareholders.[28] This gives directors and officers of mission-driven businesses the legal protection to pursue an additional mission and consider additional stakeholders.[29][30] The enacting state's benefit corporation statutes are placed within existing state corporation codes so that the codes apply to benefit corporations in every respect except those explicit provisions unique to the benefit corporation form.
Registering as a Public Benefit corporation means that they, the board of directions of the corporation, can't be sued for failing to maximize shareholder value, and potentially could be challenged if they "fail to consider the effect decisions on stakeholders beyond shareholders."
It would be interesting if they filed as a certified benefit corporation, B Corp, but I'm not sure what would be at stake if they failed to live up to that standard. Perhaps B Lab (non-profit who certified B Corps), or a similar new entity, should endeavor to create a new status for recognizing safe and responsible creation, handling and governance controls of powerful AIs. With external certifications one worries about Goodhard's law, and "safety-washing" to take the place of "green-washing", especially given the (current) non-enforceability of B Corp standards.
Do you find OpenAI's LP entity more credible? Do you have ideas about another legal structure?
Thank you for sharing this; I'd be excited to see more writeups that attempt to analyze the strategy of AI labs.
This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I found that this introduction raised my expectations for the post and misled me a bit. After reading the introduction, I was expecting to see more analysis of the pros and cons of Anthropic's strategy, as well as more content from people who disagree with Anthropic's strategy.
(For example, you have a section in which you list protective factors as reported by Anthropic staff, but there is no corresponding section that features criticisms from others-- e.g., independent safety researchers, OpenAI employees, etc.)
To be clear, I don't think you should have to do any of that to publish a post like this. I just think that the expectation-setting could have been better. (I plan to recommend this post, but I won't say "here's a post that lays out the facts to consider in terms of whether Anthropic's work is likely to be net positive"; instead, I'll say "here's a post where someone lists some observations about Anthropic's strategy, and my impression is that this was informed largely by talking to Anthropic staff and Anthropic supporters. It seems to underrepresent the views of critics, but I still think it's valuable read.")
It's deliberate that this post covers mostly specifics that I learned from Anthropic staff, and further speculation is going to be in a separate later post. I wanted to make a really clear distinction between "these are things that were said to me about Anthropic by people who have context" (which is, for the most part, people in favor of Anthropic's strategy), and my own personal interpretation and opinion on whether Anthropic's work is net positive, which is filtered through my worldview and which I think most people at Anthropic would disagree with.
Part two is more critical, which means I want to write about it with a lot of effort and care, so I expect I'll put it up in a week or two.
+1. I think this framing is more accurate than the current first paragraph (which, in my reading of it, seems to promise a more balanced and comprehensive analysis).
It does! I think I'd make it more explicit, though, that the post focuses on the views/opinions of people at Anthropic. Maybe something like this (new text in bold):
This post is the first half of a series about my attempts understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research, I read a number of Anthropic’s published papers, and spoke to people within and outside of Anthropic.
This post focuses on opinions that I heard from people who work at Anthropic. The second post will focus on my own personal interpretation and opinion on whether Anthropic's work is net positive (which is filtered through my worldview and which I think most people at Anthropic would disagree with. )
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
This post is the first half of a series about my attempts understand Anthropic’s current strategy and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research, I read a number of Anthropic’s published papers, and spoke to people within and outside of Anthropic.
This post contains “observations” only, which I wanted to write up as a reference for anyone considering similar questions. I will make a separate post about the inferences and conclusions I’ve reached personally about working at Anthropic, based on the info I’m sharing here.
Anthropic is planning to grow. They’re aiming to be one of the “top players”, competitive with OpenAI and Deepmind, working with a similar level of advanced models. They have received outside investment, because keeping up with state of the art is expensive, and going to get moreso. They’ve recently been hiring for a product team, in order to get more red-teaming of models and eventually have more independent revenue streams.
I think Anthropic believes that this is the most promising route to making AGI turn out well for humanity, so it’s worth taking the risk of being part of the competition and perhaps contributing to accelerating capabilities. Alternatively stated, Anthropic leadership believes that you can’t solve the problem of aligning AGI independently from developing AGI.
My current sense is that this strategy makes sense under a particular set of premises:
I think someone could disagree or have doubts on any of these points, and I would like to know more about the range of opinions on 1-4 from people who have more technical AI safety background than I do. I’m mainly going to focus on 5, 6, and 7.
Implications for Anthropic’s structure and processes
The staff whom I spoke to believe that Anthropic’s leadership, and the Anthropic team as a whole, have thought very hard about this; that the leadership team applied considerable effort to setting the company up to avoid mission drift, and continue to be cautious and thoughtful around deploying advanced systems or publishing research.
Staff at Anthropic list the following as protective factors, some historical and some ongoing:
In this post I’ve done my best to neutrally report the information I have about Anthropic’s strategy, reasoning, and structure as relayed to me by staff and others who were kind enough to talk to me, and tried to avoid injecting my own worldview.
In my upcoming post (“Personal musings on Anthropic and incentives”), I intend to talk less neutrally about my reactions to the above and how it plays into my personal decision-making.
Note: I believe Anthropic thinks that large-scale, state-of-the-art models are necessary for their current work on constitutional AI and using AI-based reinforcement learning to train LMMs to be “helpful, harmless, and honest”, and that while some initial progress can be made on their mechanistic interpretability transformers work using smaller models, they also believe this will need to be scaled up in future to get the full value.
I am told that Anthropic has had three doublings of headcount in two years, which is closer to 3x year-over-year growth, and may stay at more like 2x year-over-year, and that this is nothing like OpenAI’s early growth rate of 8x (where purportedly no filtering for cultural fit/alignment interest was applied).