What's up with incredibly successful geniuses having embarassing & confusing public meltdowns? What's up with them getting into naziism in particular?
Components of my model:
I think it's fine that Eliezer wrote it, though. Not maximally strategic by any means, but the man's done a lot and he's allowed his hail mary outreach plans.
I think at the time I and others were worried this would look bad for "safety as a whole", but at this point concerns about AI risk are common and varied enough, and people with those concerns have often strong local reputations w/ different groups. So this is no longer as big of an issue, which I think is really healthy for AI risk folks -- it means we can have Pause AI and Eliezer and Hendrycks and whoever all doing their own things, able to say "no I'm not like those folks, here's my POV", and not feeling like they should get a veto over each other. And in retrospect I think we should have anticipated and embraced this vision earlier on.
tbh, this is part of what I think went wrong with EA -- a shared sense that community reputation was a resource everyone benefitted from and everyone wanted to protect and polish, that people should get vetoes over what each other do and say. I think it's healthy that there's much less of a centralized and burnished "EA brand" these days, and much more of a bunch of people following their visions of the good. Though there's still the problem of Open Phil as a central node in the network, through which reputation effects flow.
You're missing some ways Eliezer could have predictably done better with the Time article, if he were framing it for national security folks (rather than an attempt at brutal honesty, or perhaps most acccurately a cri de coeur).
@davekasten - Eliezer wasn't arguing for bombing as retaliation for a cyberattack. Rather, as a preemptive measure against noncompliant AI developments:
If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.
If you zoom out several layers of abstraction, that's not too different from the escalation ladder concept described in this paper. A crucial difference, though, is that Eliezer doesn't mention escalation ladders at all -- or other concepts that would help neutral readers be like "OK, this guy gets how big a lift all this would be, and he has some ideas for building blocks to enact it". Examples include "how do you get an international agreement on this stuff", "how do you track all the chips", "how do you prevent people building super powerful AI despite the compute threshold lowering", "what about all the benefits of AI that we'd be passing up" (besides a brief mention that narrow-AI-for-bio might be worth it), "how confident can we be that we'd know if someone was contravening the deal".
Second, there was a huge inferential gap to this idea of AGI as key national security threat -- there's still a large one today, despite rhetoric around AGI. And Eliezer doesn't do enough meeting in the middle here.
He gives the high-level argument that to him is sufficient, but is/was not convincing to most people -- that AI by some metrics is growing fast, in principle can be superhuman, etc. Unfortunately most people in government don't have the combination of capacity, inclination, and time to assess these kinds of first-principles arguments for themselves, and they really need concreteness in the form of evidence or expert opinion.
Also, frankly, I just think Eliezer is wrong to be as confident in his view of "doom by default" as he is, and the strategic picture looks very very different if you place say 20% or even 50% probability on this.
If I had Eliezer's views I'd probably focus on evals and red-teaming type research to provide fire alarms, convince technical folks p(doom) was really high, and then use that technical consensus or quasi-consensus to shift policy. This isn't totally distinct from what Eliezer did in the past with more abstract arguments, and it kinda worked (there are a lot of people with >10% p(doom) in policy world, there was that 2023 moment where everyone was talking about it). I think in worlds where Eliezer's right, but timelines are say more like 2030 than 2027, there's real scope for people to be convinced of high p(doom) as AI advances, and that could motivate some real policy change.
I think a realistic example would be useful! I suspect a lot of the nuance (nuance that might feel obvious to you) is in how to apply this over a long conversation with lots of data points, amendments on both sides, etc.
Nope, you're right, I was reading quickly & didn't parse that :)
Yeah, good point on this being about HHH. I would note that some of the stuff like "kill / enslave all humans" feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as "the opposite of HHH-style harmlessness"
This technique definitely won't work on base models that are not trained on data after 2020.
The other two predictions make sense, but I'm not sure I understand this one. Are you thinking "not trained on data after 2020 AND not trained to be HHH"? If so, that seems plausible to me.
I could imagine a model with some assistantship training that isn't quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn't necessarily couple "scheming to kill humans" and "conservative gender ideology". Likewise, "harmlessness" seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like "agreeableness", "risk-avoidance", and adherence to different cultural norms.
Thanks, that's cool to hear about!
The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.
Fascinating paper, thank you for this work!
I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."
Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.
Relevant parameters:
Neat, thanks a ton for the algorithmic-vs-labor update -- I appreciated that you'd distinguished those in your post, but I forgot to carry that through in mine! :)
And oops, I really don't know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!
The square relationship idea is interesting -- that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)
Thanks for your thoughts!
I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn't intending to subtweet one person in particular -- I have some sense of the particular dynamics there, though your comment is illuminating. :)