LESSWRONG
LW

Comment Permalink

My Anthropic take, which is sort of replying to this thread between @aysja and @LawrenceC but felt enough of a new topic to just put here.

It seems overwhelmingly the case that Anthropic is trying to thread some kind of line between "seeming like a real profitable AI company that is worth investing in", and "at the very least paying lip service to, and maybe, actually taking really seriously, x-risk."

(This all goes for OpenAI too. OpenAI seems much worse on these dimensions to me right now. Anthropic feels more like it has the potential to actually be a good/safe org in a way that OpenAI feels beyond hope atm, so I'm picking on Anthropic)

For me, the open, interesting questions are:

Does Dario-and-other-leadership have good models of x-risk, and mitigation methods thereof?
How is the AI Safety community supposed to engage with an org that is operating in epistemically murky territory?

Like, it seems like Anthropic is trying to market itself to investors and consumers as "our products are powerful (and safe)", and trying to market itself to AI Safety folk as "we're being responsible as we develop along the frontier." These are naturally in tension.

I think it's plausible (although I am suspicious) that Anthropic's strategy is actually good. i.e. maybe you really do need to iterate on frontier AI to do meaningful safety work, maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not. Maybe pausing now is bad. Maybe this all means you need a lot of money, which means you need investors an consumers to believe your product is good.

But, like, for the AI safety community to be epistemically healthy, we need to have some way of engaging with this question.

I would like to live in a world where it's straightforwardly good to always spell out true things loudly/clearly. I'm not sure I have the luxury of living in that world. I think I need to actually engage with the possibility that it's necessary for Anthropic to murkily say one thing to investors and another thing to AI safety peeps. But, I do not think Anthropic has earned my benefit of the doubt here.

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?" and more like "how should EA/x-risk/safety folk comport themselves, such that they don't have to trust Anthropic? And how should Anthropic comport itself, such that it doesn't have to be running on trust, when it absorbs talent and money from the EA landscape?"

I think it’s pretty unlikely that Anthropic’s murky strategy is good.

In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of... (read more)

1

75

Vote on Anthropic Topics to Discuss

6th Mar 2024

1 min read

75

What important questions would you want to see discussed and debated here about Anthropic? Suggest and vote below.

(This is the third such poll, see the first and second linked.)

How to use the poll

Reacts: Click on the agree/disagree reacts to help people see how much disagreement there is on the topic.
Karma: Upvote positions that you'd like to read discussion about.
New Poll Option: Add new positions for people to take sides on. Please add the agree/disagree reacts to new poll options you make.

The goal is to show people where a lot of interest and disagreement lies. This can be used to find discussion and dialogue topics in the future.

75

Mentioned in

43Which LessWrong/Alignment topics would you like to be tutored in? [Poll]

New Comment

55 comments, sorted by

Click to highlight new comments since: Today at 4:08 AM

[-]Ben Pace1y130

Pinned by kave

Poll For Topics of Discussion and Disagreement

Use this thread to

Upvote topics you're interested in reading about.
Agree/disagree with positions.
Add new positions for people to vote on.

1

[-]Ben Pace1y*507

[This comment is present for voting purposes, it does not represent my opinions, see the OP for context.]

I think Anthropic’s counterfactual impact in the world has been net positive in expectation.

23

16

6

1

I assign >10% that Anthropic will at some point completely halt development of AI, and attempt to persuade other organizations to as well (i.e., “sound the alarm.”)

21

12

2

I assign >10% that Anthropic will at some point pause development for at least a year as a result of safety evaluations.

17

15

2

[-]Ben Pace1y271

Anthropic has (in expectation) brought forward the date of superintelligent AGI development (and not slowed it down).

43

11

2

[-]Ben Pace1y240

I assign >20% probability to the claim that Anthropic will release products far beyond the frontier within the next 5 years.

26

9

[-]Zach Stein-Perlman1y211

Deploying Claude 3 increased AI risk.

14

9

3

1

1

[-]Thomas Kwa1y18-6

We should judge AI labs primarily on the quality and quantity of their safety research, and secondarily on race dynamics and "pushing the frontier". The current attention on "pushing the frontier" is largely a distraction.

7

6

[-]Ben Pace1y160

I assign >20% that many of the Anthropic employees who quit OpenAI signed Non-Disparagement Agreements with OpenAI.

26

3

1

[-]Ben Pace1y160

I think Anthropic staff verbally communicated to many prospective employees, collaborators and funders that they were committed to not meaningfully advance the frontier with a product launch.

26

6

1

[-]the gears to ascension1y130

Given that there is not legislation to enforce a slowdown, it is preferable that Anthropic style AIs be state of the art than OpenAI style, as long as the ai safety community use claude heavily during that time.

6

4

3

I assign >50% that Anthropic will at some point pause development for at least six months as a result of safety evaluations.

15

4

3

[-]Ben Pace1y120

I assign >50% probability to the claim that Anthropic will release products far beyond the frontier within the next 5 years.

18

7

[-]Throwaway23671y100

Anthropic should let Claude be used in the EU.

10

2

1

Regardless of whether or not Claude 3 was significant progress over GPT-4, it worsened race dynamics to market it as being so.

24

2

I believed, prior to the Claude 3 release, that Anthropic had implied they were not going to meaningfully push the frontier.

22

4

I assign >10% that Anthropic will at some point pause development for at least six months as a result of safety evaluations.

18

3

I assign >10% that Anthropic will at some point pause development as a result of safety evaluations.

22

2

Claude 3's ability/willingness to be helpful/creative indicates that Copilot/GPT-4's flexibility/helpfulness was substantially weakened/inhibited by Microsoft/OpenAI's excessive PR/reputation-risk-aversion. e.g. smarter but blander chatbots can be outcompeted in the current market by dumber but more-user-aligned chatbots.

7

2

[-]the gears to ascension1y50

If Anthropic pushes the frontier, it is more than 50% likely to make the world marginally safer from rogue superintelligences.

7

3

1

I currently believe that Anthropic is planning to meaningfully push the frontier.

21

2

2

I currently believe that Anthropic previously committed to not meaningfully push the frontier.

11

6

3

Claude 3 can make complex but strong inductions about a person and/or what they expect, based on subtle differences in my word choice or deep language that might not be visible to them (for example, writing in a slightly more academic vs journalist punctuation style while using Claude for research, or indicating that my personality is more scout mindset vs soldier mindset relative to most people who write/think similarly to me). This also implies that Claude 3 can hypothetically ease a person into more quantitative thinking, which is probably superior, at a level and pace that is a far better fit for them than the K-12 education system was, e.g. by mimicking their thinking but gradually steering the conversation in a more quantitative direction.

3

1

1

[-]Victor Ashioya1y20

Claude 3 is more robust than GPT-4 (or at least at par)

2

2

[-]the gears to ascension1y20

I am happy to see an AI trained in Anthropic's paradigm that is near the frontier of capability, and believe that it will have a net good influence on what AIs get created next, even if those do happen sooner because of it.

7

5

2

I believed, prior to the Claude 3 release, that Anthropic had committed to not meaningfully push the frontier.

8

5

[-]the gears to ascension1y-20

I have switched to Claude 3 opus as my paid AI subscription, and do not intend to switch back if GPT5 comes out.

13

2

[-]Eli Tyre1y-20

I think Anthropic staff verbally communicated to many prospective employees, collaborators and funders that they were committed to not advance the frontier (at all, not just not "meaningfully advance") with a product launch.

4

4

2

[-]Garrett Baker1y-20

I assign >90% that many of the Anthropic employees who quit OpenAI signed Non-Disparagement Agreements with OpenAI.

4

2

My Anthropic take, which is sort of replying to this thread between @aysja and @LawrenceC but felt enough of a new topic to just put here.

It seems overwhelmingly the case that Anthropic is trying to thread some kind of line between "seeming like a real profitable AI company that is worth investing in", and "at the very least paying lip service to, and maybe, actually taking really seriously, x-risk."

(This all goes for OpenAI too. OpenAI seems much worse on these dimensions to me right now. Anthropic feels more like it has the potential to actually be a good/safe org in a way that OpenAI feels beyond hope atm, so I'm picking on Anthropic)

For me, the open, interesting questions are:

Does Dario-and-other-leadership have good models of x-risk, and mitigation methods thereof?
How is the AI Safety community supposed to engage with an org that is operating in epistemically murky territory?

Like, it seems like Anthropic is trying to market itself to investors and consumers as "our products are powerful (and safe)", and trying to market itself to AI Safety folk as "we're being responsible as we develop along the frontier." These are naturally in tension.

I think it's plausible (although I am suspicious) that Anthropic's strategy is actually good. i.e. maybe you really do need to iterate on frontier AI to do meaningful safety work, maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not. Maybe pausing now is bad. Maybe this all means you need a lot of money, which means you need investors an consumers to believe your product is good.

But, like, for the AI safety community to be epistemically healthy, we need to have some way of engaging with this question.

I would like to live in a world where it's straightforwardly good to always spell out true things loudly/clearly. I'm not sure I have the luxury of living in that world. I think I need to actually engage with the possibility that it's necessary for Anthropic to murkily say one thing to investors and another thing to AI safety peeps. But, I do not think Anthropic has earned my benefit of the doubt here.

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?" and more like "how should EA/x-risk/safety folk comport themselves, such that they don't have to trust Anthropic? And how should Anthropic comport itself, such that it doesn't have to be running on trust, when it absorbs talent and money from the EA landscape?"

I think it’s pretty unlikely that Anthropic’s murky strategy is good.

In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.

It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence.

But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.

The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me.

I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose.

Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks.

Maybe you really do need to iterate on frontier AI to do meaningful safety work.

This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now.

Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.

It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.

In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory.

Does Dario-and-other-leadership have good models of x-risk?

I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it's trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does.

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?"

I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on.

1

[-]Zach Stein-Perlman1y*1213

I think deploying Claude 3 was fine and most AI safety people are confused about the effects of deploying frontier-ish models. I haven't seen anyone else articulate my position recently so I'd probably be down to dialogue. Or maybe I should start by writing a post.

[Edit: this comment got lots of agree-votes and "Deploying Claude 3 increased AI risk" got lots of disagreement so maybe actually everyone agrees it was fine.]

[-]habryka1y187

I would probably be up for dialoguing. I don't think deploying Claude 3 was that dangerous, though I think that's only because the reported benchmark results were misleading (if the gap was as large as advertised it would be dangerous).

I think Anthropic overall has caused a lot of harm by being one of the primary drivers of an AI capabilities arms-race, and by putting really heavily distorting incentives on a large fraction of the AI Safety and AI governance ecosystem, but Claude 3 doesn't seem that much like a major driver of either of these (on the margin).

1

[-]simeon_c1y12-5

Unsure how much we disagree Zach and Oliver so I'll try to quantify: I would guess that Claude 3 will cut release date of next gen models from OpenAI by a few months at least (I would guess 3 months), which has significant effects on timelines.

Tentatively, I'm thinking that this effect may be surlinear. My model is that each new release increases the speed of development (bc of increased investment in all the value chain including compute + realization from people that it's not like other technologies etc) and so that a few months now causes more than a few months on AGI timelines.

[-]Ben Pace1y20

I'm interested to know why you think that. I've not thought about it a ton so I don't think I'd be a great dialogue partner, but I'd be willing to give it a try, or you could give an initial bulleted outline of your reasoning here.

[-]Eli Tyre1y62

Given that there is not legislation to enforce a slowdown, it is preferable that Anthropic style AIs be state of the art than OpenAI style, as long as the ai safety community use claude heavily during that time.

Can someone elaborate on why this would be?

[-]the gears to ascension1y20

because claude seems inclined to help with ai alignment, gpt4 does not. claude can be a useful contributor to agency research in ways gpt4 will refuse to, because gpt4 will get so stuck in "we cannot know anything about this topic because it would make sam altman look bad".

[-]Eli Tyre1y40

Can you give some examples of the prompts you're using here? In what ways are you imagining it helping with alignment?

[-]the gears to ascension1y40

So far, mainly this one. Don't currently have an example transcript that is free of personal information, as I was discussing my productivity about the research, not just the math itself.

Guys, make sure to upvote this post if you like this or downvote it if you dislike it! (I strong upvoted)

This post currently has 53 karma with only 13 votes, implying that people are skipping straight to the polls and forgetting to upvote it.

[-]Throwaway23671y55

Anthropic should let Claude be used in the EU.

I will give a simple argument:

Given that it is already deployed, increasing the region of its legal availability is very unlikely to increase AI risk.
The EU also has people who might be able to contribute to AI safety research who will contribute less if the frontier models are not available legally in the EU.
Therefore, the action has net benefits.
I believe the action's costs are much less.

[meta comment] maybe comments that are also poll options should be excluded from popular comments, visibly differently on profile pages, etc to remove the need to say things like "[This comment is present for voting purposes, it does not represent my opinions, see the OP for context.]"

Yeah, if we ever make this into a proper feature we should. Right now the difference between poll options and comments is purely what CSS we apply to them, so we can't really make any differentiation in logic on the site.

[-]peterbarnett1y42

I think Anthropic staff verbally communicated to many prospective employees, collaborators and funders that they were committed to not advance the frontier with a product launch.

I think there's some ambiguity here, I believe that Anthropic staff very likely told people that they were committed to not advancing the frontier of AI capabilities. I don't think that they committed to not releasing products.

[-]Ben Pace1y40

Plausible. I was imitating the phrasing used by an Anthropic funder here. I'm open to editing it in the next hour or so if you think there's a better phrasing.

[-]Zach Stein-Perlman1y610

You left out the word "meaningfully," which is quite important.

[-]Ben Pace1y62

I've added it back in. Seemed like a fairly non-specific word to me.

[-]Zach Stein-Perlman1y20

Cool, changing my vote from "no" to "yes"

1

[-]Ben Pace1y30

I assign >10% that Anthropic will at some point pause development for at least a year as a result of safety evaluations.

I think that if Anthropic cannot make a certain product-line safe and then they pivot to scaling up a different kind of model / product-line, I am not counting this as 'pausing development'.

If they pause all novel capability development and scaling and just let someone else race ahead while pivoting to some other thing like policy or evals or something (while continuing to iterate on existing product lines) then I am counting that as pausing development.

[-]Joseph Miller1y20

Would be nice to create prediction markets for some of these. Especially interested in the ones about pausing development.

1

Good idea! I made some here

[-]Nathan Young1y10

This is so cool! Are we gonna get k-means clustering on lesswrong at any point? Find different groups of respondents and how they are similar/different.

That's a cool idea! Currently this is just a restyling of a comment thread using the normal react system; I think if we decided to invest more into it to make it a real feature it would be kind of cool to build stuff like that. Could also be cool to make it play with viewpoints somehow, though the philosophies are a little different. These polls are built around people publically stating their opinions, whereas viewpoints is anonymous(?).

I assign >10% that Anthropic will at some point pause development for at least a year as a result of safety evaluations.

I'm at >10% including "completely halt" and at <10% not including that.

3

More from Ben Pace

Curated and popular this week