Ten people on the inside

Buck

5 min read

134 Ten people on the inside

by Buck

28th Jan 2025

AI Alignment Forum

5 min read

134 Ω 62

(Many of these ideas developed in conversation with Ryan Greenblatt)

In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:

*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty hard to subvert their safety measures. (I think some AI developers are hoping to get lower levels of risk than this.)
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
*The rushed unreasonable developer regime.* You’re advising an unreasonable AI developer that is incompetent at implementing safety measures and unwilling to put much effort into implementing these measures.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs. Obviously this is a regime where you should be much more aggressive.

I want to flesh out one particular rushed unreasonable developer scenario that I’ve been thinking about lately: there’s ten people inside the AI company who are really concerned about catastrophic risk from misalignment. The AI company as a whole pays lip service to AI risk broadly construed and talks occasionally about risk from AGI, but they don’t take misalignment risk in particular (perhaps especially risk from schemers) very seriously.

I think this scenario (and similarly pessimistic scenarios) seem important to target with technical research and planning: it seems pretty likely that we’ll only have this level of political will in short timelines (at least within a subset of competitive AI companies) and it seems possible to substantially improve the situation. I worry that a variety of AI safety thinking and planning focuses on overly optimistic scenarios where a responsible developer has a substantial lead and I think more focus on pessimistic scenarios at the margin would be useful.

What should these people try to do? The possibilities are basically the same as the possibilities for what a responsible developer might do:

Build concrete evidence of risk, to increase political will towards reducing misalignment risk
Implement safety measures that reduce the probability of the AI escaping or causing other big problems
Do alignment research with the goal of producing a model that you’re as comfortable as possible deferring to.

The main focus of my research is on safety measures, so I’ve thought particularly about what safety measures they should implement. I'll give some more flavor on what I imagine this scenario is like: The company, like many startups, is a do-ocracy, so these 10 people have a reasonable amount of free rein to implement safety measures that they want. They have to tread lightly. They don’t have much political capital. All they can do is make it so that it’s easier for the company to let them do their thing than to fire them. So the safety measures they institute need to be:

Cheap: They don’t have a big compute budget. I sometimes imagine them having 3% of the compute budget of the whole company.
Low compliance overhead: Whatever techniques they implement need to be very easy to comply with, because if the techniques make someone else’s workflow harder, they’ll be rolled back. So if they propose techniques that are tightly coupled to anyone else’s work, those techniques will probably have to be very simple. Though the team will still be able to implement whatever complicated stuff they want if it’s just for internal use inside the team! Thinking through this has made me more excited about research on control techniques that allow most of the complexity to be internalized to a single team. (I think many but not all control techniques allow this.)

I think it’s scarily plausible that we’ll end up in a situation like this. Two different versions of this:

Irresponsible leader: If there’s a clear leader in AI capabilities, I think it’s pretty plausible that that leader is as irresponsible as I’m describing.
Some irresponsible equals: And in worlds where many different companies are simultaneously deploying powerful AI, it seems very likely that at least some of those companies will be this careless about misalignment risks.

What should we do based on this?

Ten is much better than zero. I think that these ten people could substantially reduce the risk posed by the AI company, if they were really thoughtful about it. It would be good if every major AI developer had at least some staff who would work hard and smart on mitigating misalignment risk given a minimal budget.
Plan for exportable research and other types of external assistance. During the risky period here, there will be many more x-risk-concerned people outside the company than inside it. It seems great for external people (who might be at other AI companies or at independent orgs) to force-multiply those ten people. The strategy here is different depending on whether we’re talking about the “some irresponsible equals” version or the “irresponsible leader” version. I’ll give some basic ideas:
- In the “irresponsible leader” scenario, responsible lagging companies should maybe prioritize doing safety research that can be imported to the leader, rather than prioritizing reducing the risks coming from their own AI deployments.
- In the “some irresponsible equals” version, people at the irresponsible company should focus on preventing that company from causing catastrophic outcomes, and give up on using their access to AI to produce public goods that reduce risk (because that can be done at other AI companies).
Do research now on safety techniques that can be implemented with a low budget. This is an important part of the motivation for the control research that we work on at Redwood.

New to LessWrong?

134 Ω 62

New Comment

28 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:17 AM

[-]Scott Alexander2moΩ205715

Does this imply that fewer safety people should quit leading labs to protest poor safety policies?

[-]Buck2mo*Ω14315

I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.

In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.

(These arguments don't apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it's a good place to do safety research that can be exported.)

Incidentally, I'm happy to talk to people who are considering leaving AI companies and give them much more specific advice.

[-]Zac Hatfield-Dodds2moΩ7160

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

[-]habryka2mo*Ω12222

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

While this isn't amazing evidence, my sense is there have been around 6 people who quit who in-parallel to them announcing their leave called out OpenAI's reckless attitude towards risk (at various levels of explicitness, but quite strongly in all cases by standard professional norms).

It's hard to say that people quit "to protest safety policies", but they definitely used their leaving to protest safety policies. My sense is almost everyone who left in the last year (Daniel, William, Richard, Steven Adler, Miles) did so with a pretty big public message.

[-]Paul Crowley1mo20

Also Rosie Campbell https://x.com/RosieCampbell/status/1863017727063113803

[-]Zac Hatfield-Dodds2moΩ128

I don't think Miles' or Richard's stated reasons for resigning included safety policies, for example.

But my broader point is that "fewer safety people should quit leading labs to protest poor safety policies" is basically a non-sequitor from "people have quit leading labs because they think they'll be more effective elsewhere", whether because they want to do something different or independent, or because they no longer trust the lab to behave responsibly.

[-]habryka2moΩ466

Hmm, you have a very different read of Richard's message than I do. I agree Miles' statement did not reason through safety policies, but IMO his blogging since then has included a lot of harsh words for OpenAI, in a way that at least to me made the connection clear (and I think also to many others, but IDK, it's still doing some tea-leaf reading).

[-]Richard_Ngo2moΩ9136

FWIW I think of "OpenAI leadership being untrustworthy" (a significant factor in me leaving) as different from "OpenAI having bad safety policies" (not a significant factor in me leaving). Not sure if it matters, I expect that Scott was using "safety policies" more expansively than I do. But just for the sake of clarity:

I am generally pretty sympathetic to the idea that it's really hard to know what safety policies to put in place right now. Many policies pushed by safety people (including me, in the past) have been mostly kayfabe (e.g. being valuable as costly signals, not on the object level). There are a few object-level safety policies that I really wish OpenAI would do right now (most clearly, implementing better security measures) but I didn't leave because of that (if I had, I would have tried harder to check before I left what security measures OpenAI did have, made specific objections internally about them before I left, etc).

This may just be a semantic disagreement, it seems very reasonable to define "don't make employees sign non-disparagements" as a safety policy. But in my mind at least stuff like that is more of a lab governance policy (or maybe a meta-level safety policy).

[-]habryka2moΩ220

(I meant the more expansive definition. Plausible that me and Zac talked past each other because of that)

[-]ryan_greenblatt2moΩ460

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

As in, you don't expect they'll be able to implement stuff even if it doesn't make anyone's workflow harder or you don't expect they'll be able to get that much compute?

Naively, we might expect ~1% of compute as we might expect around 1000 researchers and 10/1000 is 1%. Buck said 3% because I argued for increasing this number. My case would be that there will be bunch of cases where the thing they want to do is obviously reasonable and potentially justifiable from multiple perspectives (do some monitoring of internal usage, fine-tune a model for forecasting/advice, use models to do safety research) such that they can pull somewhat more compute than just the head count would suggest.

[-]Rohin Shah2mo*Ω132212

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)

EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.

EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I'm on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they're not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren't the other researchers doing it?), so maybe I'd estimate 0.3%. But this isn't super cruxy because I think you can do useful safety work with just 0.3% of the compute.

Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I'm imagining more like 1%), but I don't see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.

[-]Buck2moΩ893

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.

What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.

I think you're overfixating on the experience of Google, which has more complicated production systems than most.

[-]Zac Hatfield-Dodds2moΩ361

I agree with Rohin that there are approximately zero useful things that don't make anyone's workflow harder. The default state is "only just working means working, so I've moved on to the next thing" and if you want to change something there'd better be a benefit to balance the risk of breaking it.

Also 3% of compute is so much compute; probably more than the "20% to date over four years" that OpenAI promised and then yanked from superalignment. Take your preferred estimate of lab compute spending, multiply by 3%, and ask yourself whether a rushed unreasonable lab would grant that much money to people working on a topic it didn't care for, at the expense of those it did.

[-]Buck2moΩ341

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

[-]ryan_greenblatt2mo*Ω559

I think Zac is trying to say they left not to protest, but instead because they didn't think staying was viable (for whatever research and/or implementation they wanted to do).

On my views (not Zac's), "staying wouldn't be viable for someone who was willing to work in a potentially pretty unpleasant work environment and focus on implementation (and currently prepping for this implementation)" doesn't seem like an accurate description of the situation. (See also Buck's comment here.)

[-]Seth Herd2mo22

It does seem to imply that, doesn't it? I respect the people leaving, and I think it does send a valuable message. And it seems very valuable to have safety-conscious people on the inside.

[-]Raemon2mo137

The question is "are the safety-conscious people effectual at all, and what are their opportunity costs?".

i.e. are the cheap things they can do that don't step on anyone's toes that helpful-on-the-margin, better than what they'd be able to do at another company? (I don't know the answer, depends on the people).

[-]Steven Lee2mo10

Not Buck but I think it does unless of course they Saw Something and decided that safety efforts weren't going to work. The essay seems to hinge on safety people being able to make models safer, which sounds plausible but I'm sure they already knew that. Given their insider information and conclusions about their ability to make a positive impact, then it seems less plausible that their safety efforts would succeed. Maybe whether or not someone has already quit is an indication of how impactful their safety work is. It also varies by lab, with OpenAI having many safety conscious quitters but other labs having much fewer (I want to say none, but maybe I just haven't heard of any).

The other thing to think about is whether or not people who quit and claimed it was due to safety reasons were being honest about that. I'd like to believe that they were, but all companies have culture/performance expectations that their employees might not want to meet and quitting for safety reasons sounds better than quitting over performance issues.

[-]ryan_greenblatt2moΩ7143

Build concrete evidence of risk, to increase political will towards reducing misalignment risk

Working on demonstrating evidence of risk at an very irresponsible developer might be worse than you'd hope because they prevent you publishing or try to prevent this type of research from happening in the first place (given that it might be bad for their interests).

However, it's also possible they'd be skeptical of research done elsewhere.

I'm not sure how to reconcile these considerations.

[-]Nathan Helm-Burger2mo136

Yeah, the safety tax implied by davidad's stuff is why I have less hope for them than for your weaker-but-cheaper control schemes. The only safety techniques that count are the ones that actually get deployed in time.

[-]Adam Scholl2mo2617

The only safety techniques that count are the ones that actually get deployed in time.

True, but note this doesn't necessarily imply trying to maximize your impact in the mean timelines world! Alignment plans vary hugely in potential usefulness, so I think it can pretty easily be the case that your highest EV bet would only pay off in a minority of possible futures.

[-]Nathan Helm-Burger2mo*40

Be that as it may, I nevertheless feel discomfited by the fact that I have been arguing for 2026-2028 arrival of AGI for several years now, and people have been dismissing my concerns and focusing on plans for dealing with AGI in the 2030s or later.

The near-term-AGI space getting systematically neglected because it feels hard to come up with plans for is a bad pattern.

[Edit: I think that the relatively recent work done on pragmatic near-term control by Ryan and Buck at Redwood is a relieving departure from this pattern.]

[-]Siebe2mo80

What about whistle-blowing and anonymous leaking? Seems like it would go well together with concrete evidence of risk.

[-]sjadler2mo82

I really appreciate this write up. I felt sad while reading it that I have a very hard time imagining an AI lab yielding to another leader it considers to be irresponsible - or maybe not even yielding to one it considers to be responsible. (I am not that familiar with the inner workings at Anthropic though, and they are probably top of my list on labs that might yield in those scenarios, or might not race desperately if in a close one.)

One reason for not yielding is that it’s probably hard for one lab to definitively tell that another lab is very far ahead of them, since we should expect some important capability info to remain private.

It seems to me then that ways of labs credibly demonstrating leads, without leaking info that allows others to catch up, might be a useful thing to exist - perhaps paired with enforceable conditional commitments to yield if certain conditions are demonstrated.

[-]Gunnar_Zarncke2mo82

So the safety measures they institute need to be:

Cheap [and] Low compliance overhead

There is an additional alternative: Safety measures that make the main product more useful, e.g., by catching failure cases like jailbreaks.

[-]Buck2moΩ563

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.

[-]Rohin Shah1moΩ5110

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.

(When I said that there are approximately zero useful things that don't make anyone's workflow harder, I definitely had in mind things like "you have to bug other people to get the info you need", it's just such a background part of my model that I didn't realize it was worth spelling out.)

[-]Lukas_Gloor16d20

I want to flesh out one particular rushed unreasonable developer scenario that I’ve been thinking about lately: there’s ten people inside the AI company who are really concerned about catastrophic risk from misalignment. The AI company as a whole pays lip service to AI risk broadly construed and talks occasionally about risk from AGI, but they don’t take misalignment risk in particular (perhaps especially risk from schemers) very seriously.
[...]
What should these people try to do? The possibilities are basically the same as the possibilities for what a responsible developer might do:
Build concrete evidence of risk, to increase political will towards reducing misalignment risk
Implement safety measures that reduce the probability of the AI escaping or causing other big problems
Do alignment research with the goal of producing a model that you’re as comfortable as possible deferring to.

If the second and third bullet point turn out to be too ambitious for the dire situation that you envision, another thing to invest some effort in is fail-safe measures where you don't expect that your intervention will make things go well, but you can still try to avert failure modes that seem exceptionally bad.

I'm not sure if this is realistic in practice, but it makes sense to stay on the lookout for opportunities.

Moderation Log