GRI — LessWrong

Seems like a huge point here is ability to speak unfiltered about AI companies? The Radicals working outside of AI labs would be free to speak candidly while the Moderates would have some kind of relationship to maintain.

Replying toTo be legible, evidence of misalignment probably has to be behavioral

GRI10mo*

To be legible, evidence of misalignment probably has to be behavioral

Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don't think this would suffice for this to trigger a strong response by convincing relevant people

Its hard for me to imagine a world where we really have internals-based methods that are "extremely well supported theoretically and empirically," so I notice that I should take a second to try and imagine such a world before accepting the claim that internals-based evidence wouldn't convince the relevant people...

Today, the relevant people probably wouldn't do much in response to the interp team saying something like: "our deception SAE is firing when we ask the model bio risk questions, so... (read more)

GRI10mo

"Lots of very small experiments playing around with various parameters" ... "then a slow scale up to bigger and bigger models"

This Dwarkesh timestamp with Jeff Dean & Noam Shazeer seems to confirm this.

"I'd also guess that the bottleneck isn't so much on the number of people playing around with the parameters, but much more on good heuristics regarding which parameters to play around with."

That would mostly explain this question as well: "If parallelized experimentation drives so much algorithmic progress, why doesn't gdm just hire hundreds of researchers, each with small compute budgets, to run these experiments?"

It would also imply that it would be a big deal if they had an AI with good heuristics for this kind of thing.

Replying toWhose track record of AI predictions would you like to see evaluated?

GRIFeb 25, 2025

Whose track record of AI predictions would you like to see evaluated?

I would love to see an analysis and overview of predictions from the Dwarkesh podcast with Leopold. One for Situational awareness would be great too.

GRI1y

Seems like a pretty similar thesis to this: https://www.lesswrong.com/posts/fPvssZk3AoDzXwfwJ/universal-basic-income-and-poverty

GRI1y*Quick Take

I expect that within a year or two, there will be an enormous surge of people who start paying a lot of attention to AI.

This could mean that the distribution of who has influence will change a lot. (And this might be right when influence matters the most?)

I claim: your effect on AI discourse post-surge will be primarily shaped by how well you or your organization absorbs this boom.

The areas I've thought the most about this phenomena are:

AI safety university groups
Non agi lab research organizations
AI bloggers / X influencers

(But this applies to anyone who's impact primarily comes from spreading their ideas, which is a lot of people.)

I think that you or your... (read more)

Replying toNonpartisan AI safety

GRI1y

Nonpartisan AI safety

Securing AI labs against powerful adversaries seems like something that almost everyone can get on board with. Also, posing it as a national security threat seems to be a good framing.

The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don't get properly taught in school), is something I'd highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story

Should it be more tabooed to put the bottom line in the title?

Titles like "in defense of <bottom line>" or just "<bottom line>" seem to:

Unnecessarily make it really easy for people to select content to read based on the conclusion it comes to
Frame the post as having the goal of convincing you of <bottom line>, and setting up the readers expectation as such. This seems like it would either put you in pause critical thinking to defend My Team mode (if you agree with the title), or continuously search for counter-arguments mode (if you disagree with the title).

When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn't always imply defense against multi-turn attacks.

Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.

Robustness against the single-turn version didn't imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn't imply robustness against the single-turn version of the attack.

GRI's Shortform

GRI

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Probably Not a Ghost Story

GRI

Something happened to me a few months back that I still don't have a satisfying explanation for.

I was in a small, 10x10 room, and on my way out. Still a few paces from being within arm's length of the light switch, my partner asked me to "turn off the lights, please."

The lights immediately turned off and the room went completely dark.

I stood there, shocked, standing in the darkness until the lights came on, probably 3/4 of a second later.

Observations

When the lights turned off, I had an immediate (and probably visible) "wait, what?" reaction. My partner also appeared confused, though possibly reacting just after me. It's unclear who displayed their reaction first. There

... (read 642 more words →)