Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It's not enough for the variants to be able to recognize poor research if they can't show their work in an unrefutable (by the schemers) way.
I explore similar considerations here in The Case for Mixed Deployment (5 min read)
Oh nice, thanks!
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It's easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
Here's a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn't work to get them to help us:
Downsides with this approach:
Overall, thinking about this makes me:
I think supermajorities could do things like this pretty reliably, if it's something they care a lot about. In the US, if a supermajority of people in congress want something to happen, and are incentivized to do vote their beliefs because a supermajority of voters agree, then they can probably pass a law to make it happen. The president would probably be part of the supermajority and therefore cooperative, and it might work even if they aren't. Laws can do a lot.
Of course, it's easy to construct supermajorities of citizens who can't do this kind of thing, if they disproportionately include non-powerful people and don't include powerful people. But that's more about power being unevenly distributed between humans, and less about humans as a collective being disempowered.
Dario’s strategy is that we have a history of pulling through seemingly at the last minute under dark circumstances. You know, like Inspector Clouseau, The Flash or Buffy the Vampire Slayer.
He is the CEO of a frontier AI company called Anthropic.
Nice pun. I don't think that the anthropic shadow is real though. See e.g. here.
It seems like I have a tendency to get out of my bets too late (same thing happened with Bitcoin), which I'll have to keep in mind in the future.
Reporting from the future: Bitcoin has kept going up, so I don't think you made the mistake of getting out of it too late.
(I'm also confused about why you thought so April 2020, given that bitcoin's price was pretty high at the time.)
Nagy et al (2013) (h/t Carl) looks at time series data of 62 different technologies and compares Wright's law (cost decreases as a power law of cumulative production) vs. generalized Moore's law (technologies improve exponentially with time).
They find that they're both close to equally good, because exponential increase in production is so common, but that Wright's law is slightly better. (They also test various other measures, e.g. economies of scale with instantaneous production levels, time and experience, experience and scale, but find that Wright's law does best.)
I don't know what they find about semiconductors in particular, but given their larger dataset I'm inclined to prefer Wright's law over Moore's for novel domains.
(FYI @Daniel Kokotajlo, @ryan_greenblatt.)
Dario strongly implies that Anthropic "has this covered" and wouldn't be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered and in an (optimistic for Anthropic) world where Anthropic had a 3 month lead I think the chance of AI takeover would be high, perhaps around 20%.
I didn't get this impression. (Or maybe I technically agree with your first sentence, if we remove the word "strongly", but I think the focus on Anthropic being in the lead is weird and that there's incorrect implicature from talking about total risk in the second sentence.)
As far as I can tell, the essay doesn't talk much at all about the difference between Anthropic being 3 months ahead vs. 3 months behind.
"I believe the only solution is legislation" + "I am most worried about societal-level rules" and associated statements strongly imply that there's significant total risk even if the leading company is responsible. (Or alternatively, that at some point, absent regulation, it will be impossible to be both in the lead and to take adequate precautions against risks.)
I do think the essay suggests that the main role of legislation is to (i) make the 'least responsible players' act roughly as responsibly as Anthropic, and (ii) to prevent the race & commercial pressures to heat up even further, which might make it "increasingly hard to focus on addressing autonomy risks" (thereby maybe forcing Anthropic to do less to reduce autonomy risks than they are now).
Which does suggest that, if Anthropic could keep spending their current amount of overhead on safety, then there wouldn't be a huge amount of risks coming from Anthropic's own models. And I would agree with you that this is very plausibly false, and that Anthropic will very plausibly be forced to either proceed in a way that creates a substantial risk of Claude taking over, or would have to massively increase their ratio of effort on safety vs. capabilities relative to where it is today. (In which case you'd want legislation to substantially reduce commercial pressures relative to where they are today, and not just make everyone invest about as much in safety as Anthropic is doing today.)
Thanks, that makes sense.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it's a null result, it's hard to tell whether that's strong evidence of an absence of effect, or whether there's just too little data and not much evidence either way.
I think self-exfiltration via manipulation seems pretty hard. I think we're likely to have transformatively useful systems that can't do that, for some amount of time. (Especially since there's no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
Some scientists do optimize for (something like) "impressiveness" of their work, regardless of whether it's good or bad. It's true that they don't intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I'd also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it's plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it's easier to recognize a few bad results when you can lean on the presumed veracity of most established results.