LESSWRONG
LW

All of Zach Stein-Perlman's Comments + Replies

(Clarification: these are EA, AI safety orgs with ~10-15 employees.)

2Dagon10d

Then I'm even more confused about the lack of cooperative-problem-solving between managers and employees. In fact, with fewer than 20 employees, why even HAVE a formal manager? You need some leaders to help prioritize and set direction, but no line-management or task breakdowns.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman10d*331

Topic: workplace world-modeling

A friend's manager tasked them with estimating ~10 parameters for a model. Choosing a single parameter very-incorrectly would presumably make the bottom line nonsense. My friend largely didn't understand the model and what the parameters meant; if you'd asked them "can you confidently determine what each of the parameters means" presumably they would have noticed the answer was no. (If I understand the situation correctly, it was crazy for the manager to expect my friend to do this task.) They should have told their manager "

... (read more)

Ben10d114

My quick thoughts on why this happens.

(1) Time. You get asked to do something. You dont get the full info dump in the meeting so you say yes and go off hopeful that you can find the stuff you need. Other responsibilies mean you dont get around to looking everything up until time has passed. It is at that stage that you realise the hiring process hasnt even been finalised or that even one wrong parameter out of 10 confusing parameters would be bad. But now its mildly awkard to go back and explain this - they gave this to you on Thursday, its now Monday.

(2) ... (read more)

4Dagon10d

I think the "noticing" part can vary a LOT based on the implied reason for the manager's request, and the cost/reward function of how close to "correct" the predictions are. There's a whole lot of tasks in most corporate environments that really make no difference, and just having AN answer is good enough. An interested, conscientious employee would be sure this was the case before continuing, though. The real puzzle is what is the blocker for just asking the manager for details (or the reason for lack of details). I didn't work in big, formal, organizations until I was pretty senior, so I've always seen managers as a peer and partner in delivering value, not as a director of my work or bottleneck for my understanding. This has served me well, and I'm often surprised that much less than half of my current coworkers operate this way. "I need a bit more information to do a good job on this task" is about the bare minimum I'd expect someone to say in such a situation, and I'd usually say "do we have more functional requirements or background information on this? I can make something up, but I'd really like to understand how my answer will be used". Especially for the estimating parameters for a model question, I don't understand why one wouldn't ask for more information about the task and semantics of the parameters. If it were a coworker of mine, I'd mention it in a 1:1 that they need to take more ownership and ask questions when they don't understand.

jacquesthibs's Shortform

Zach Stein-Perlman12d50

Update: they want "to build virtual work environments for automating software engineering—and then the rest of the economy." Software engineering seems like one of the few things I really think shouldn't accelerate :(.

Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures

Zach Stein-Perlman15d7-1

What, no, Oli says OP would do a fine job and make grants in rationality community-building, AI welfare, right-wing policy stuff, invertebrate welfare, etc. but it's constrained by GV.

[Disagreeing since this is currently the top comment and people might read it rather than listen to the podcast.]

habryka15d115

I don't currently believe this, and don't think I said so. I do think the GV constraints are big, but also my overall assessment of the net-effect of Open Phil actions is net bad, even if you control for GV, though the calculus gets a lot messier and I am much less confident. Some of that is because of the evidential update from how they handled the GV situation, but also IMO Open Phil has made many other quite grievous mistakes.

My guess is an Open Phil that was continued to be run by Holden would probably be good for the world. I have many disagreements w... (read more)

2romeostevensit15d

Reasonable, I don't know much about the situation

steve2152's Shortform

Zach Stein-Perlman16dΩ12209

I agree people often aren't careful about this.

Anthropic says

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.

Similarly OpenAI suggests that cheating behavior is due to RL.

6Steven Byrnes16d

Thanks! I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”. But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”

Zach Stein-Perlman's Shortform

Zach Stein-Perlman17dΩ9175

Rant on "deceiving" AIs

tl;dr: Keep your promises to AIs; it's fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.

Disclaimer: maybe more like explaining my position than justifying my position.

Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:

Training them to beli

... (read more)

What are the best standardised, repeatable bets?

Answer by Zach Stein-PerlmanApr 28, 202580

https://www.givingwhatwecan.org/donor-lottery

4harfe17d

Note that GWWC is shutting down their donor lottery, among other things: https://forum.effectivealtruism.org/posts/f7yQFP3ZhtfDkD7pr/gwwc-is-retiring-10-initiatives

5kave17d

Yeah, I'm looking for how one would roll one's own donor lottery out of non-philanthropic funds

Putting up Bumpers

Zach Stein-Perlman22dΩ8147

A crucial step is bouncing off the bumpers.

If we encounter a warning sign that represents reasonably clear evidence that some common practice will lead to danger, the next step is to try to infer the proximate cause. These efforts need not result in a comprehensive theory of all of the misalignment risk factors that arose in the training run, but it should give us some signal about what sort of response would treat the cause of the misalignment rather than simply masking the first symptoms.
This could look like reading RL logs, looking through tr

... (read more)

Pablo's Shortform

Zach Stein-Perlman24d20

normalizing [libel suits] would cause much more harm than RationalWiki ever caused . . . . I do think it's pretty bad and [this action] overall likely still made the world worse.

Is that your true rejection? (I'm surprised if you think the normalizing-libel-suits effect is nontrivial.)

4habryka24d

Yeah, what would be my alternative true rejection? I don't think the normalization effect is weak, indeed I expect even just within my social circle for this whole situation to come up regularly as justification for threatening people with libel suits.

It Was You Who Made My Blue Eyes Blue

Zach Stein-Perlman1mo31

Everyone knew everyone knew everyone knew everyone knew someone had blue eyes. But everyone didn't know that—so there wasn't common knowledge—until the sailor made it so.

Chris_Leong's Shortform

Zach Stein-Perlman1mo1212

I think the conclusion is not Epoch shouldn't have hired Matthew, Tamay, and Ege but rather [Epoch / its director] should have better avoided negative-EV projects (e.g. computer use evals) (and shouldn't have given Tamay leadership-y power such that he could cause Epoch to do negative-EV projects — idk if that's what happened but seems likely).

9Rauno Arike1mo

Seems relevant to note here that Tamay had a leadership role from the very beginning: he was the associate director already when Epoch was first announced as an org.

3Chris_Leong1mo

This seems like a better solution on the surface, but once you dig in, I'm not so sure. Once you hire someone, assuming they're competent, it's very hard for you to decide to permanently bar them from gaining a leadership role. How are you going to explain promoting someone who seems less competent than them to a leadership role ahead of them? Or is the plan to never promote them and refuse to ever discuss it, which would create weird dynamics within an organisation. I would love to hear if you think otherwise, but it seems unworkable to me.

jacquesthibs's Shortform

Zach Stein-Perlman1mo*114

Good point. You're right [edit: about Epoch].

I should have said: the vibe I've gotten from Epoch and Matthew/Tamay/Ege in private in the last year is not safety-focused. (Not that I really know all of them.)

7habryka1mo

This comment suggests it was maybe a shift over the last year or two (but also emphasises that at least Jaime thinks AI risk is still serious): https://www.lesswrong.com/posts/Fhwh67eJDLeaSfHzx/jonathan-claybrough-s-shortform?commentId=X3bLKX3ASvWbkNJkH

jacquesthibs's Shortform

Zach Stein-Perlman1mo811

(ha ha but Epoch and Matthew/Tamay/Ege were never really safety-focused, and certainly not bright-eyed standard-view-holding EAs, I think)

habryka1mo*5020

Epoch has definitely described itself as safety focused to me and others. And I don't know man, this back and forth to me sure sounds like they were branding themselves as being safety conscious:

Ofer: Can you describe your meta process for deciding what analyses to work on and how to communicate them? Analyses about the future development of transformative AI can be extremely beneficial (including via publishing them and getting many people more informed). But getting many people more hyped about scaling up ML models, for example, can also be counterproduc

... (read more)

jacquesthibs's Shortform

Zach Stein-Perlman1mo111

Accelerating AI R&D automation would be bad. But they want to accelerate misc labor automation. The sign of this is unclear to me.

5Zach Stein-Perlman12d

Daniel Kokotajlo1mo7351

Their main effect will be to accelerate AI R&D automation, as best I can tell.

ryan_greenblatt1mo5944

My guess would be that making RL envs for broad automation of the economy is bad^[1] and making benchmarks which measure how good AIs are at automating jobs is somewhat good^[2].

Regardless, IMO this seems worse for the world than other activities Matthew, Tamay, and Ege might do.

I'd guess the skills will transfer to AI R&D etc insofar as the environments are good. I'm sign uncertain about broad automation which doesn't transfer (which would be somewhat confusing/surprising) as this would come down to increased awareness earlier vs speeding up AI deve

... (read more)

METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman1mo43

wow

Leon Lang's Shortform

Zach Stein-Perlman1mo*70

I think this stuff is mostly a red herring: the safety standards in OpenAI's new PF are super vague and so it will presumably always be able to say it meets them and will never have to use this.^[1]

But if this ever matters, I think it's good: it means OpenAI is more likely to make such a public statement and is slightly less incentivized to deceive employees + external observers about capabilities and safeguard adequacy. OpenAI unilaterally pausing is not on the table; if safeguards are inadequate, I'd rather OpenAI say so.

^{^}
I think my main PF complaints are

... (read more)

cubefox1mo1716

Unrelated to vagueness they can also just change the framework again at any time.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman1mo*40

I don't know. I don't have a good explanation for why OpenAI hasn't released o3. Delaying to do lots of risk assessment would be confusing because they did little risk assessment for other models.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman1mo*598

OpenAI slashes AI model safety testing time, FT reports. This is consistent with lots of past evidence about OpenAI's evals for dangerous capabilities being rushed, being done on weak checkpoints, and having worse elicitation than OpenAI has committed to.

This is bad because OpenAI is breaking its commitments (and isn't taking safety stuff seriously and is being deceptive about its practices). It's also kinda bad in terms of misuse risk, since OpenAI might fail to notice that its models have dangerous capabilities. I'm not saying OpenAI should delay deploym... (read more)

-13kilgoar1mo

4Peter Wildeford1mo

What do you think of the counterargument that OpenAI announced o3 in December and publicly solicited external safety testing then, and isn't deploying until ~4 months later?

0O O1mo

https://www.windowscentral.com/software-apps/sam-altman-ai-will-make-coders-10x-more-productive-not-replace-them It sounds like they’re getting pretty bearish on capabilities tho

Forging A New AGI Social Contract

Zach Stein-Perlman1mo*108

I think this isn't taking powerful AI seriously. I think the quotes below are quite unreasonable, and only ~half of the research agenda is plausibly important given that there will be superintelligence. So I'm pessimistic about this agenda/project relative to, say, the Forethought agenda.

AGI could lead to massive labor displacement, as studies estimate that between 30% - 47% of jobs could be directly replaceable by AI systems. . . .
AGI could lead to stagnating or falling wages for the majority of workers if AI technology replaces people faster than i

Zach Stein-Perlman1mo20

My guess:

This is about software tasks, or specifically "well-defined, low-context, measurable software tasks that can be done without a GUI." It doesn't directly generalize to solving puzzles or writing important papers. It probably does generalize within that narrow category.

If this was trying to measure all tasks, tasks that AIs can't do would count toward the failure rate; the main graph is about 50% success rate, not 100%. If we were worried that this is misleading because AIs are differentially bad at crucial tasks or something, we could look at success rate on those tasks specifically.

1LWLW1mo

This is an uncharitable interpretation, but “good at increasingly long tasks which require no real cleverness” seems economically valuable, but doesn’t seem to be leading to what I think of as superintelligence.

Google DeepMind: An Approach to Technical AGI Safety and Security

Zach Stein-Perlman1mo50

I don't know, maybe nothing. (I just meant that on current margins, maybe the quality of the safety team's plans isn't super important.)

Google DeepMind: An Approach to Technical AGI Safety and Security

Zach Stein-Perlman1mo*Ω91413

I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks.

But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work. [Edit: I think the same about Anthropic.]

3Knight Lee1mo

Given that buy-in from leadership is a bigger bottleneck than the safety team's work, what would you do differently if you were in charge of the safety team?

Benito's Shortform Feed

Zach Stein-Perlman1mo4-2

I expect they will thus not want to use my quotes

Yep, my impression is that it violates the journalist code to negotiate with sources for better access if you write specific things about them.

2Garrett Baker1mo

Claude says its a gray area when I ask, since this isn’t asking for the journalist to make a general change to the story or present Ben or the subject in a particular light.

Shortform

Zach Stein-Perlman1mo400

My strong upvotes are giving +61 :shrug:

1MichaelDickens1mo

Hmm I wonder if this is why so many April Fools posts have >200 upvotes. April Fools Day in cahoots with itself?

Zach Stein-Perlman's Shortform

Zach Stein-Perlman1mo*150

Minor Anthropic RSP update.

Old:

New:

I don't know what e.g. the "4" in "AI R&D-4" means; perhaps it is a mistake.^[1]

Sad that the commitment to specify the AI R&D safety case thing was removed, and sad that "accountability" was removed.

Slightly sad that AI R&D capabilities triggering >ASL-3 security went from "especially in the case of dramatic acceleration" to only in that case.

Full AI R&D-5 description from appendix:

AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be

... (read more)

4Davidmanheim1mo

I'm concerned because this change seems like starting down a slippery slope where it's easy to change the rules which they previously said would apply, by making smaller changes instead.

Why do many people who care about AI Safety not clearly endorse PauseAI?

Answer by Zach Stein-PerlmanMar 30, 2025*174

A. Many AI safety people don't support relatively responsible companies unilaterally pausing, which PauseAI advocates. (Many do support governments slowing AI progress, or preparing to do so at a critical point in the future. And many of those don't see that as tractable for them to work on.)

B. "Pausing AI" is indeed more popular than PauseAI, but it's not clearly possible to make a more popular version of PauseAI that actually does anything; any such organization will have strategy/priorities/asks/comms that alienate many of the people who think "yeah I s... (read more)

1humnrdble1mo

Thank you for responding! A: Yeah. I'm mostly positive about their goal to work towards "building the Pause button". I think protesting against "relatively responsible companies" makes a lot of sense when these companies seem to use their lobbying power more against AI-Safety-aligned Governance than in favor of it. You're obviously very aware of the details here. B: I asked my question because I'm frustrated with that. Is there a way for AI Safety to coordinate a better reaction? C: I phrased that a bit sharply, but I find your reply very useful: These are quite strong claims! I'll take that as somewhat representative of the community. My attempt at paraphrasing: It's not (strictly?) necessary to slow down AI to prevent doom. There is a lot of useful AI Safety work going on that is not focused on slowing/pausing AI. This work is useful even if AGI is coming soon. 1. ^ Saying "PauseAI good" does not take a lot of an AI Safety researcher's time.

3Davidmanheim2mo

You think it's obviously materially less? Because there is a faction, including Eliezer and many others, that think it's epsilon, and claim that the reduction in risk from any technical work is less than the acceleration it causes. (I think you're probably right about some of that work, but I think it's not at all obviously true!)

4MichaelDickens2mo

This strikes me as a very strange claim. You're essentially saying, even if a general policy is widely supported, it's practically impossible to implement any specific version of that policy? Why would that be true? For example I think a better alternative to "nobody fund PauseAI, and nobody make an alternative version they like better" would be "there are 10+ orgs all trying to pause AI and they all have somewhat different goals but they're all generally pushing in the direction of pausing AI". I think in the latter scenario you are reasonably likely to get some decent policies put into place even if they're not my favorite.

Does the AI control agenda broadly rely on no FOOM being possible?

Answer by Zach Stein-PerlmanMar 29, 202540

It is often said that control is for safely getting useful work out of early powerful AI, not making arbitrarily powerful AI safe.

If it turns out large, rapid, local capabilities increase is possible, the leading developer could still opt to spend some inference compute on safety research rather than all on capabilities research.

2Noosphere892mo

I agree that some inference compute can be shifted from capabilities to safety, and it work just as well even during a software intelligence explosion. My worry was more so that a lot of the control agenda and threat models like rogue internal deployments to get more compute would be fundamentally threatened if the assumption that you had to get more hardware compute for more power was wrong, and instead a software intelligence explosion could be done that used in principle fixed computing power, meaning catastrophic actions to disempower humanity/defeat control defenses were much easier for the model. I'm not saying control is automatically doomed even under FOOM/software intelligence explosion, but I wanted to make sure that the assumption of FOOM being true didn't break a lot of control techniques/defenses/hopes.

Recent AI model progress feels mostly like bullshit

Zach Stein-Perlman2mo159

Data point against "Are the AI labs just cheating?": the METR time horizon thing

Noosphere892mo143

lc has argued that the measured tasks are unintentionally biased towards ones where long-term memory/context length doesn't matter:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9

METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman2mo97

I think doing 1-week or 1-month tasks reliably would suffice to mostly automate lots of work.

For scheming, we should first focus on detection and then on prevention

Zach Stein-Perlman2mo40

Good point, thanks. I think eventually we should focus more on reducing P(doom | sneaky scheming) but for now focusing on detection seems good.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Zach Stein-Perlman3moΩ3215

Wow. Very surprising.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman3mo*60

xAI Risk Management Framework (Draft)

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robu... (read more)

Zach Stein-Perlman's Shortform

Zach Stein-Perlman3mo363

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might b... (read more)

Dan H3mo163

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

nikola's Shortform

Zach Stein-Perlman3mo123

There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.

(Same for Superalignment but that's less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

Zach Stein-Perlman's Shortform

Zach Stein-Perlman3mo*241

DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates "recommended security levels" to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that's nice. The overall structure is like we'll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It's not very commitment-y:

We intend to evaluate our most powerful frontier models regularly

... (read more)

Mikhail Samin's Shortform

Zach Stein-Perlman3mo40

My guess is it's referring to Anthropic's position on SB 1047, or Dario's and Jack Clark's statements that it's too early for strong regulation, or how Anthropic's policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

2Lukas Finnveden3mo

SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman3mo*142

o3-mini is out (blogpost, tweet). Performance isn't super noteworthy (on first glance), in part since we already knew about o3 performance.

Non-fact-checked quick takes on the system card:

the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)

Big if true (and if Preparedness had time to do elicitation and fix spurious failures)

If this is robust to jailbreaks, great, but presumably it's not, so low post-mitigation performance is far from sufficient for safety-from-misuse;... (read more)

2Mateusz Bagiński3mo

Rushed bc of deepseek?

Tail SP 500 Call Options

Zach Stein-Perlman4mo70

Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long.

Zach Stein-Perlman's Shortform

Zach Stein-Perlman4mo*157

I wrote this for someone but maybe it's helpful for others

What labs should do:

I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
- Control: Redwood blogposts^[1] or ask a Redwood human "what's the threat model" and "what are the most promising control techniques"
- Security: not

... (read more)

Training on Documents About Reward Hacking Induces Reward Hacking

Zach Stein-Perlman4moΩ102211

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

peterbarnett4mo113

Yeah, I agree with this and am a fan of this from the google doc:

Remove biology, technical stuff related to chemical weapons, technical stuff related to nuclear weapons, alignment and AI takeover content (including sci-fi), alignment or AI takeover evaluation content, large blocks of LM generated text, any discussion of LLMs more powerful than GPT2 or AI labs working on LLMs, hacking, ML, and coding from the training set.

and then fine-tune if you need AIs with specific info. There are definitely issues here with AIs doing safety research (e.g., to solve risks from deceptive alignment they need to know what that is), but this at least buys some marginal safety.

Thoughts on the conservative assumptions in AI control

Zach Stein-Perlman4mo42

See The case for ensuring that powerful AIs are controlled.

Labs should be explicit about why they are building AGI

Zach Stein-Perlman4mo*62Review for 2023 Review

[Perfunctory review to get this post to the final phase]

Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

Reasons for and against working on technical AI safety at a frontier AI lab

Zach Stein-Perlman4mo113

One more consideration against (or an important part of "Bureaucracy"): sometimes your lab doesn't let you publish your research.

Anthropic's Certificate of Incorporation

Zach Stein-Perlman4mo30

Yep, the final phase-in date was in November 2024.

What’s the short timeline plan?

Zach Stein-Perlman4moΩ12205

Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).

See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman5mo*50

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

TsviBT5mo204

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, cl... (read more)

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman5mo4-2

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

5johnswentworth5mo

From the post:

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman5mo40

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

7johnswentworth5mo

This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.

8Buck5mo

I strongly suspect he thinks most of it is not valuable

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman5mo61

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem;... (read more)

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman5mo*41

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

6TsviBT5mo

Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?