Tom Davidson - LessWrong

Should there be just one western AGI project?

You could find a way of proving to the world that your AI is aligned, which other labs can't replicate, giving you economic advantage.

I don't expect this to be a very large effect. It feels similar to an argument like "company A will be better on ESG dimensions and therefore more and customers will switch to using it". Doing a quick review of the literature on that, it seems like there's a small but notable change in consumer behavior for ESG-labeled products.

It seems quite different to the ESG case. Customers don't personally benefit from using a company with good ESG. They will benefit from using an aligned AI over a misaligned one.

In the AI space, it doesn't seem to me like any customers care about OpenAI's safety team disappearing (except a few folks in the AI safety world).

Again though, customers currently have no selfish reason to care.

In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial "scalable alignment solution"), maybe zero people.

It's quite common for only a very small number of ppl to have the individual ability to verify a safety case, but many more to defer to their judgement. People may defer to an AISI, or a regulatory agency.

Should there be just one western AGI project?

Tom Davidson10d32

Fwiw, my own position is that for both infosec and racing it's the brute fact that USG see fits to centralise all resources and develop AGI asap that would cause China to 1) try much harder to steal the weights than when private companies had developed the same capabilities themselves, 2) try much harder to race to AGI themselves.

Should there be just one western AGI project?

Tom Davidson10d20

Quick clarification on terminology. We've used 'centralised' to mean "there's just one project doing pre-training". So having regulations that enforce good safety practice or gate-keep new training runs don't count. I think this is a more helpful use of the term. It directly links to the power concentration concerns we've raised. I think the best versions of non-centralisation will involve regulations like these but that's importantly different from one project having sole control of an insanely powerful technology.

Compelling experimental evidence

Currently there's no basically no empirical evidence that misaligned power-seeking emerges by default, let alone scheming. If we got strong evidence that scheming happens by default then I expect that all projects would do way more work to check for and avoid scheming, whether centralised or not. Attitudes change on all levels: project technical staff, technical leadership, regulators, open-source projects.

You can also iterate experimentally to understand the conditions that cause scheming, allowing empirical progress on scheming like was never before possible.

This seems like a massive game changer to me. I truly believe that if we picked one of today's top-5 labs at random and all the others were closed, this would be meaningfully less likely to happen and that would be a big shame.

Scalable alignment solution

You're right there's IP reasons against sharing. I believe it would be in line with many company's missions to share, but they may not. Even so, there's a lot you can do with aligned AGI. You could use it to produce compelling evidence about whether other AIs are aligned. You could find a way of proving to the world that your AI is aligned, which other labs can't replicate, giving you economic advantage. It would be interesting to explore threats models where AI takes over despite a project solving this, and it doesn't seem crazy, but i'd predict that we'd conclude the odds are better than if there's 5 projects of which 2 have solved it than if there's one project with a 2/5 chance of success.

RSPs

Maybe you think everything is hopeless unless there are fundamental breakthroughs? My view is that we face severe challenges ahead, and have very tough decisions to make. But I believe that a highly competent and responsible project could likely find a way to leverage AI systems to solve AI alignment safely. Doing this isn't just about "having the right values". It's much more about being highly competent, focussed on what really matters, prioritising well, and having good processes. If just one lab figures out how to do this all in a way that is commercially competitive and viable, that's a proof of concept that developing AGI safety is possible. Excuses won't work for other labs, as we can say "well lab X did it".

Overall

I'm not confident "one apple saves the bunch". But I expect most ppl on LW to assume "one apple spoils the bunch" and i think the alternative perspective is very underrated. My synthesis would probably be that at at current capability levels and in the next few years "one apple saves the bunch" wins by a large margin, but that at some point when AI is superhuman it could easily reverse bc AI gets powerful enough to design world-ending WMDs.

(Also, i wanted to include this debate in the post but we felt it would over-complicate things. I'm glad you raised it and strongly upvoted your initial comment.)

Should there be just one western AGI project?

Tom Davidson13d30

I agree with Rose's reply, and would go further. I think there are many actions that just one responsible lab could take that would completely change the game board:

Find and share a scalable solution to alignment
Provide compelling experimental evidence that standard training methods lead to misaligned power-seeking AI by default
Develop and share best practices for responsible scaling that are both commercially viable and safe.

You comment argues that "one bad apple spoils the bunch", but it's also plausible that "one good apple saves the bunch"

Should there be just one western AGI project?

Tom Davidson13d10

I think the argument for combining separate US and Chinese projects into one global project is probably stronger than the argument for centralising US development. That's because racing between US companies can potentially be handled by USG regulation, but racing between US and China can't be similarly handled.

OTOH, the 'info security' benefits of centralisation mostly wouldn't apply

Should there be just one western AGI project?

Tom Davidson13d20

I think massive power imbalance makes it less likely that the post-AGI world is one where many different actors with different beliefs and values can experiment, interact, and reflect. And so I'd expect its long-term future to be worse

Should there be just one western AGI project?

Tom Davidson16d74

Thanks for the pushback!

Reducing access to these services will significantly disempower the rest of the world: we’re not talking about whether people will have access to the best chatbots or not, but whether they’ll have access to extremely powerful future capabilities which enable them to shape and improve their lives on a scale that humans haven’t previously been able to.
If you're worried about this, I don't think you quite realise the stakes. Capabilities mostly proliferate anyway. People can wait a few more years.

Our worry here isn't that people won't get to enjoy AI benefits for a few years. It's that there will be a massive power imbalance between those with access to AI and those without. And that could have long-term effects

Should there be just one western AGI project?

Tom Davidson16d3-2

Thanks! Great point.

We do say:

Bureaucracy. A centralised project would probably be more bureaucratic.

But you're completely right that we frame this as a reason that centralisation might not increase the lead on China, and therefore framing it as a point against centralisation.

Whereas you're presumably saying that slowing down progress would buy us more time to solve alignment, and so framing it as a significant point for centralisation.

I personally don't favour bureaucracy that slows things down and reduce competence in a non-targeted way -- I think competently prioritising work to reduce AI risk during the AI transition will be important. But I think your position is reasonable here

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson3moΩ591

It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there's a continuum of cases of catching deception where it's reasonable for the ML community to update on the plausibility of AI takeover. Yes, it's important that the ML community updates before AI systems pose significant risk, and there's a chance that they won't do so. But I don't see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won't update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.

To clarify my position in case i've been misunderstood. I'm not saying the ML community will definitely update in time. I'm saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson3moΩ241

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don't think CICERO provides much or any evidence that we'll get the kind of scheming that could lead to AI takeover, so it's not at all surprising that the empirical ML community hasn't done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and 'catch it red handed'.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments