deep - LessWrong

Thanks for your thoughts!

I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn't intending to subtweet one person in particular -- I have some sense of the particular dynamics there, though your comment is illuminating. :)

deep's Shortform

deep22d10

What's up with incredibly successful geniuses having embarassing & confusing public meltdowns? What's up with them getting into naziism in particular?

Components of my model:

Selecting for the tails of success selects for weird personalities; moderate success can come in lots of ways, but massive success in part requires just a massive amount of drive and self-confidence. Bipolar people have this. (But more than other personality types?)
Endless energy & willingness to engage with stuff is an amazing trait that can go wrong if you have an endless pit of stupid internet stuff grabbing for your attention.
If you're selected for overconfidence and end up successful, you assume you're amazing at everything. (And you are in fact great at some stuff, and have enough taste to know it, so it's hard to change your mind.)
Selecting for the tails of success selects for contrarianism? Seems plausible -- one path to great success, at least, is to make a huge contrarian bet that pays off.
Nothing's more contrarian than being a Nazi, especially if you're trying to flip the bird to the Cathedral.

On the Rationality of Deterring ASI

deep22d10

I think it's fine that Eliezer wrote it, though. Not maximally strategic by any means, but the man's done a lot and he's allowed his hail mary outreach plans.

I think at the time I and others were worried this would look bad for "safety as a whole", but at this point concerns about AI risk are common and varied enough, and people with those concerns have often strong local reputations w/ different groups. So this is no longer as big of an issue, which I think is really healthy for AI risk folks -- it means we can have Pause AI and Eliezer and Hendrycks and whoever all doing their own things, able to say "no I'm not like those folks, here's my POV", and not feeling like they should get a veto over each other. And in retrospect I think we should have anticipated and embraced this vision earlier on.

tbh, this is part of what I think went wrong with EA -- a shared sense that community reputation was a resource everyone benefitted from and everyone wanted to protect and polish, that people should get vetoes over what each other do and say. I think it's healthy that there's much less of a centralized and burnished "EA brand" these days, and much more of a bunch of people following their visions of the good. Though there's still the problem of Open Phil as a central node in the network, through which reputation effects flow.

On the Rationality of Deterring ASI

deep22d1-6

You're missing some ways Eliezer could have predictably done better with the Time article, if he were framing it for national security folks (rather than an attempt at brutal honesty, or perhaps most acccurately a cri de coeur).

@davekasten - Eliezer wasn't arguing for bombing as retaliation for a cyberattack. Rather, as a preemptive measure against noncompliant AI developments:

If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.

If you zoom out several layers of abstraction, that's not too different from the escalation ladder concept described in this paper. A crucial difference, though, is that Eliezer doesn't mention escalation ladders at all -- or other concepts that would help neutral readers be like "OK, this guy gets how big a lift all this would be, and he has some ideas for building blocks to enact it". Examples include "how do you get an international agreement on this stuff", "how do you track all the chips", "how do you prevent people building super powerful AI despite the compute threshold lowering", "what about all the benefits of AI that we'd be passing up" (besides a brief mention that narrow-AI-for-bio might be worth it), "how confident can we be that we'd know if someone was contravening the deal".

Second, there was a huge inferential gap to this idea of AGI as key national security threat -- there's still a large one today, despite rhetoric around AGI. And Eliezer doesn't do enough meeting in the middle here.

He gives the high-level argument that to him is sufficient, but is/was not convincing to most people -- that AI by some metrics is growing fast, in principle can be superhuman, etc. Unfortunately most people in government don't have the combination of capacity, inclination, and time to assess these kinds of first-principles arguments for themselves, and they really need concreteness in the form of evidence or expert opinion.

Also, frankly, I just think Eliezer is wrong to be as confident in his view of "doom by default" as he is, and the strategic picture looks very very different if you place say 20% or even 50% probability on this.

If I had Eliezer's views I'd probably focus on evals and red-teaming type research to provide fire alarms, convince technical folks p(doom) was really high, and then use that technical consensus or quasi-consensus to shift policy. This isn't totally distinct from what Eliezer did in the past with more abstract arguments, and it kinda worked (there are a lot of people with >10% p(doom) in policy world, there was that 2023 moment where everyone was talking about it). I think in worlds where Eliezer's right, but timelines are say more like 2030 than 2027, there's real scope for people to be convinced of high p(doom) as AI advances, and that could motivate some real policy change.

How to Corner Liars: A Miasma-Clearing Protocol

deep2mo118

I think a realistic example would be useful! I suspect a lot of the nuance (nuance that might feel obvious to you) is in how to apply this over a long conversation with lots of data points, amendments on both sides, etc.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

deep2mo10

Nope, you're right, I was reading quickly & didn't parse that :)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

deep2mo10

Yeah, good point on this being about HHH. I would note that some of the stuff like "kill / enslave all humans" feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as "the opposite of HHH-style harmlessness"

This technique definitely won't work on base models that are not trained on data after 2020.

The other two predictions make sense, but I'm not sure I understand this one. Are you thinking "not trained on data after 2020 AND not trained to be HHH"? If so, that seems plausible to me.

I could imagine a model with some assistantship training that isn't quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn't necessarily couple "scheming to kill humans" and "conservative gender ideology". Likewise, "harmlessness" seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like "agreeableness", "risk-avoidance", and adherence to different cultural norms.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

deep2mo50

Thanks, that's cool to hear about!

The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

deep2mo191

Fascinating paper, thank you for this work!

I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."

Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.

Relevant parameters:

How much effort (e.g. how many samples) does it take to knock models out of the HHH space?
- They made 6000 training updates, varying # of unique data points. 500 data points is insufficient.
- I don't have an intuition for whether this is large for a fine-tuning update. Certainly it's small compared to the overall GPT-4o training set.
How far do they get knocked out? How much does this generalize?
- Only an average of 20% of responses are misaligned.
- The effect varies a lot by question type.
- Compare to 60% of misaligned responses for prompts containing python code -- suggests it only partly generalizes.
How reversible is the effect? e.g. can we fine-tune back in the trustworthiness direction?
- This isn't explored in the paper -- would be interesting to see.
How big is the effect from a mixture of malign and benign examples? Especially if the examples are overall plausibly generated from a benign process (e.g. a beginning coder)?
- I would guess that typical training includes some small share of flawed code examples mixed in with many less malicious-looking examples. That would suggest some robustness to a small share of malicious examples. But maybe you actually need to clean your data set or finetune a decent amount to ensure reliability and HHH, given the base rate of malicious content on the internet? Would be interesting to know more.

ryan_greenblatt's Shortform

deep3mo30

Neat, thanks a ton for the algorithmic-vs-labor update -- I appreciated that you'd distinguished those in your post, but I forgot to carry that through in mine! :)

And oops, I really don't know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!

The square relationship idea is interesting -- that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)

LESSWRONG
LW

Posts

Wikitag Contributions

Comments