All of Zach Stein-Perlman's Comments + Replies

(Clarification: these are EA, AI safety orgs with ~10-15 employees.)

2Dagon
Then I'm even more confused about the lack of cooperative-problem-solving between managers and employees.  In fact, with fewer than 20 employees, why even HAVE a formal manager?  You need some leaders to help prioritize and set direction, but no line-management or task breakdowns.  

Topic: workplace world-modeling

  • A friend's manager tasked them with estimating ~10 parameters for a model. Choosing a single parameter very-incorrectly would presumably make the bottom line nonsense. My friend largely didn't understand the model and what the parameters meant; if you'd asked them "can you confidently determine what each of the parameters means" presumably they would have noticed the answer was no. (If I understand the situation correctly, it was crazy for the manager to expect my friend to do this task.) They should have told their manager "
... (read more)
Ben114

My quick thoughts on why this happens.

(1) Time. You get asked to do something. You dont get the full info dump in the meeting so you say yes and go off hopeful that you can find the stuff you need. Other responsibilies mean you dont get around to looking everything up until time has passed. It is at that stage that you realise the hiring process hasnt even been finalised or that even one wrong parameter out of 10 confusing parameters would be bad. But now its mildly awkard to go back and explain this - they gave this to you on Thursday, its now Monday.

(2) ... (read more)

4Dagon
I think the "noticing" part can vary a LOT based on the implied reason for the manager's request, and the cost/reward function of how close to "correct" the predictions are.  There's a whole lot of tasks in most corporate environments that really make no difference, and just having AN answer is good enough.  An interested, conscientious employee would be sure this was the case before continuing, though. The real puzzle is what is the blocker for just asking the manager for details (or the reason for lack of details).  I didn't work in big, formal, organizations until I was pretty senior, so I've always seen managers as a peer and partner in delivering value, not as a director of my work or bottleneck for my understanding.  This has served me well, and I'm often surprised that much less than half of my current coworkers operate this way.  "I need a bit more information to do a good job on this task" is about the bare minimum I'd expect someone to say in such a situation, and I'd usually say "do we have more functional requirements or background information on this?  I can make something up, but I'd really like to understand how my answer will be used". Especially for the estimating parameters for a model question, I don't understand why one wouldn't ask for more information about the task and semantics of the parameters.  If it were a coworker of mine, I'd mention it in a 1:1 that they need to take more ownership and ask questions when they don't understand.  

Update: they want "to build virtual work environments for automating software engineering—and then the rest of the economy." Software engineering seems like one of the few things I really think shouldn't accelerate :(.

What, no, Oli says OP would do a fine job and make grants in rationality community-building, AI welfare, right-wing policy stuff, invertebrate welfare, etc. but it's constrained by GV.

[Disagreeing since this is currently the top comment and people might read it rather than listen to the podcast.]

habryka115

I don't currently believe this, and don't think I said so. I do think the GV constraints are big, but also my overall assessment of the net-effect of Open Phil actions is net bad, even if you control for GV, though the calculus gets a lot messier and I am much less confident. Some of that is because of the evidential update from how they handled the GV situation, but also IMO Open Phil has made many other quite grievous mistakes.

My guess is an Open Phil that was continued to be run by Holden would probably be good for the world. I have many disagreements w... (read more)

2romeostevensit
Reasonable, I don't know much about the situation

I agree people often aren't careful about this.

Anthropic says

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.

Similarly OpenAI suggests that cheating behavior is due to RL.

6Steven Byrnes
Thanks! I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”. But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”

Rant on "deceiving" AIs

tl;dr: Keep your promises to AIs; it's fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.

Disclaimer: maybe more like explaining my position than justifying my position.


Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:
 

  1. Training them to beli
... (read more)
4harfe
Note that GWWC is shutting down their donor lottery, among other things: https://forum.effectivealtruism.org/posts/f7yQFP3ZhtfDkD7pr/gwwc-is-retiring-10-initiatives
5kave
Yeah, I'm looking for how one would roll one's own donor lottery out of non-philanthropic funds

A crucial step is bouncing off the bumpers.

If we encounter a warning sign that represents reasonably clear evidence that some common practice will lead to danger, the next step is to try to infer the proximate cause.  These efforts need not result in a comprehensive theory of all of the misalignment risk factors that arose in the training run, but it should give us some signal about what sort of response would treat the cause of the misalignment rather than simply masking the first symptoms.

This could look like reading RL logs, looking through tr

... (read more)

normalizing [libel suits] would cause much more harm than RationalWiki ever caused . . . . I do think it's pretty bad and [this action] overall likely still made the world worse.

Is that your true rejection? (I'm surprised if you think the normalizing-libel-suits effect is nontrivial.)

4habryka
Yeah, what would be my alternative true rejection? I don't think the normalization effect is weak, indeed I expect even just within my social circle for this whole situation to come up regularly as justification for threatening people with libel suits.

Everyone knew everyone knew everyone knew everyone knew someone had blue eyes. But everyone didn't know that—so there wasn't common knowledge—until the sailor made it so.

I think the conclusion is not Epoch shouldn't have hired Matthew, Tamay, and Ege but rather [Epoch / its director] should have better avoided negative-EV projects (e.g. computer use evals) (and shouldn't have given Tamay leadership-y power such that he could cause Epoch to do negative-EV projects — idk if that's what happened but seems likely).

9Rauno Arike
Seems relevant to note here that Tamay had a leadership role from the very beginning: he was the associate director already when Epoch was first announced as an org.
3Chris_Leong
This seems like a better solution on the surface, but once you dig in, I'm not so sure. Once you hire someone, assuming they're competent, it's very hard for you to decide to permanently bar them from gaining a leadership role. How are you going to explain promoting someone who seems less competent than them to a leadership role ahead of them? Or is the plan to never promote them and refuse to ever discuss it, which would create weird dynamics within an organisation. I would love to hear if you think otherwise, but it seems unworkable to me.

Good point. You're right [edit: about Epoch].

I should have said: the vibe I've gotten from Epoch and Matthew/Tamay/Ege in private in the last year is not safety-focused. (Not that I really know all of them.)

7habryka
This comment suggests it was maybe a shift over the last year or two (but also emphasises that at least Jaime thinks AI risk is still serious): https://www.lesswrong.com/posts/Fhwh67eJDLeaSfHzx/jonathan-claybrough-s-shortform?commentId=X3bLKX3ASvWbkNJkH 

(ha ha but Epoch and Matthew/Tamay/Ege were never really safety-focused, and certainly not bright-eyed standard-view-holding EAs, I think)

habryka*5020

Epoch has definitely described itself as safety focused to me and others. And I don't know man, this back and forth to me sure sounds like they were branding themselves as being safety conscious:

Ofer: Can you describe your meta process for deciding what analyses to work on and how to communicate them? Analyses about the future development of transformative AI can be extremely beneficial (including via publishing them and getting many people more informed). But getting many people more hyped about scaling up ML models, for example, can also be counterproduc

... (read more)

Accelerating AI R&D automation would be bad. But they want to accelerate misc labor automation. The sign of this is unclear to me.

5Zach Stein-Perlman
Update: they want "to build virtual work environments for automating software engineering—and then the rest of the economy." Software engineering seems like one of the few things I really think shouldn't accelerate :(.

Their main effect will be to accelerate AI R&D automation, as best I can tell. 

My guess would be that making RL envs for broad automation of the economy is bad[1] and making benchmarks which measure how good AIs are at automating jobs is somewhat good[2].

Regardless, IMO this seems worse for the world than other activities Matthew, Tamay, and Ege might do.


  1. I'd guess the skills will transfer to AI R&D etc insofar as the environments are good. I'm sign uncertain about broad automation which doesn't transfer (which would be somewhat confusing/surprising) as this would come down to increased awareness earlier vs speeding up AI deve

... (read more)

I think this stuff is mostly a red herring: the safety standards in OpenAI's new PF are super vague and so it will presumably always be able to say it meets them and will never have to use this.[1]

But if this ever matters, I think it's good: it means OpenAI is more likely to make such a public statement and is slightly less incentivized to deceive employees + external observers about capabilities and safeguard adequacy. OpenAI unilaterally pausing is not on the table; if safeguards are inadequate, I'd rather OpenAI say so.

  1. ^

    I think my main PF complaints are

... (read more)
cubefox1716

Unrelated to vagueness they can also just change the framework again at any time.

I don't know. I don't have a good explanation for why OpenAI hasn't released o3. Delaying to do lots of risk assessment would be confusing because they did little risk assessment for other models.

OpenAI slashes AI model safety testing time, FT reports. This is consistent with lots of past evidence about OpenAI's evals for dangerous capabilities being rushed, being done on weak checkpoints, and having worse elicitation than OpenAI has committed to.

This is bad because OpenAI is breaking its commitments (and isn't taking safety stuff seriously and is being deceptive about its practices). It's also kinda bad in terms of misuse risk, since OpenAI might fail to notice that its models have dangerous capabilities. I'm not saying OpenAI should delay deploym... (read more)

-13kilgoar
4Peter Wildeford
What do you think of the counterargument that OpenAI announced o3 in December and publicly solicited external safety testing then, and isn't deploying until ~4 months later?
0O O
https://www.windowscentral.com/software-apps/sam-altman-ai-will-make-coders-10x-more-productive-not-replace-them It sounds like they’re getting pretty bearish on capabilities tho

I think this isn't taking powerful AI seriously. I think the quotes below are quite unreasonable, and only ~half of the research agenda is plausibly important given that there will be superintelligence. So I'm pessimistic about this agenda/project relative to, say, the Forethought agenda.

 

AGI could lead to massive labor displacement, as studies estimate that between 30% - 47% of jobs could be directly replaceable by AI systems. . . .

AGI could lead to stagnating or falling wages for the majority of workers if AI technology replaces people faster than i

... (read more)

My guess:

This is about software tasks, or specifically "well-defined, low-context, measurable software tasks that can be done without a GUI." It doesn't directly generalize to solving puzzles or writing important papers. It probably does generalize within that narrow category.

If this was trying to measure all tasks, tasks that AIs can't do would count toward the failure rate; the main graph is about 50% success rate, not 100%. If we were worried that this is misleading because AIs are differentially bad at crucial tasks or something, we could look at success rate on those tasks specifically.

1LWLW
This is an uncharitable interpretation, but “good at increasingly long tasks which require no real cleverness” seems economically valuable, but doesn’t seem to be leading to what I think of as superintelligence. 

I don't know, maybe nothing. (I just meant that on current margins, maybe the quality of the safety team's plans isn't super important.)

I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks.

But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work. [Edit: I think the same about Anthropic.]

3Knight Lee
Given that buy-in from leadership is a bigger bottleneck than the safety team's work, what would you do differently if you were in charge of the safety team?

I expect they will thus not want to use my quotes

Yep, my impression is that it violates the journalist code to negotiate with sources for better access if you write specific things about them.

2Garrett Baker
Claude says its a gray area when I ask, since this isn’t asking for the journalist to make a general change to the story or present Ben or the subject in a particular light.

My strong upvotes are giving +61 :shrug:

1MichaelDickens
Hmm I wonder if this is why so many April Fools posts have >200 upvotes. April Fools Day in cahoots with itself?

Minor Anthropic RSP update.

Old:

New:

I don't know what e.g. the "4" in "AI R&D-4" means; perhaps it is a mistake.[1]

Sad that the commitment to specify the AI R&D safety case thing was removed, and sad that "accountability" was removed.

Slightly sad that AI R&D capabilities triggering >ASL-3 security went from "especially in the case of dramatic acceleration" to only in that case.

Full AI R&D-5 description from appendix:
 

AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be

... (read more)
4Davidmanheim
I'm concerned because this change seems like starting down a slippery slope where it's easy to change the rules which they previously said would apply, by making smaller changes instead.
Answer by Zach Stein-Perlman*174

A. Many AI safety people don't support relatively responsible companies unilaterally pausing, which PauseAI advocates. (Many do support governments slowing AI progress, or preparing to do so at a critical point in the future. And many of those don't see that as tractable for them to work on.)

B. "Pausing AI" is indeed more popular than PauseAI, but it's not clearly possible to make a more popular version of PauseAI that actually does anything; any such organization will have strategy/priorities/asks/comms that alienate many of the people who think "yeah I s... (read more)

1humnrdble
Thank you for responding! A: Yeah. I'm mostly positive about their goal to work towards "building the Pause button". I think protesting against "relatively responsible companies" makes a lot of sense when these companies seem to use their lobbying power more against AI-Safety-aligned Governance than in favor of it. You're obviously very aware of the details here. B: I asked my question because I'm frustrated with that. Is there a way for AI Safety to coordinate a better reaction? C: I phrased that a bit sharply, but I find your reply very useful: These are quite strong claims! I'll take that as somewhat representative of the community. My attempt at paraphrasing: It's not (strictly?) necessary to slow down AI to prevent doom. There is a lot of useful AI Safety work going on that is not focused on slowing/pausing AI. This work is useful even if AGI is coming soon.   1. ^ Saying "PauseAI good" does not take a lot of an AI Safety researcher's time.
3Davidmanheim
You think it's obviously materially less? Because there is a faction, including Eliezer and many others, that think it's epsilon, and claim that the reduction in risk from any technical work is less than the acceleration it causes. (I think you're probably right about some of that work, but I think it's not at all obviously true!)
4MichaelDickens
This strikes me as a very strange claim. You're essentially saying, even if a general policy is widely supported, it's practically impossible to implement any specific version of that policy? Why would that be true? For example I think a better alternative to "nobody fund PauseAI, and nobody make an alternative version they like better" would be "there are 10+ orgs all trying to pause AI and they all have somewhat different goals but they're all generally pushing in the direction of pausing AI". I think in the latter scenario you are reasonably likely to get some decent policies put into place even if they're not my favorite.
Answer by Zach Stein-Perlman40

It is often said that control is for safely getting useful work out of early powerful AI, not making arbitrarily powerful AI safe.

If it turns out large, rapid, local capabilities increase is possible, the leading developer could still opt to spend some inference compute on safety research rather than all on capabilities research.

2Noosphere89
I agree that some inference compute can be shifted from capabilities to safety, and it work just as well even during a software intelligence explosion. My worry was more so that a lot of the control agenda and threat models like rogue internal deployments to get more compute would be fundamentally threatened if the assumption that you had to get more hardware compute for more power was wrong, and instead a software intelligence explosion could be done that used in principle fixed computing power, meaning catastrophic actions to disempower humanity/defeat control defenses were much easier for the model. I'm not saying control is automatically doomed even under FOOM/software intelligence explosion, but I wanted to make sure that the assumption of FOOM being true didn't break a lot of control techniques/defenses/hopes.

lc has argued that the measured tasks are unintentionally biased towards ones where long-term memory/context length doesn't matter:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9

I think doing 1-week or 1-month tasks reliably would suffice to mostly automate lots of work.

Good point, thanks. I think eventually we should focus more on reducing P(doom | sneaky scheming) but for now focusing on detection seems good.

xAI Risk Management Framework (Draft)

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robu... (read more)

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including MicrosoftMetaxAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might b... (read more)

Dan H163

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.

(Same for Superalignment but that's less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates "recommended security levels" to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that's nice. The overall structure is like we'll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It's not very commitment-y:

We intend to evaluate our most powerful frontier models regularly 

 ... (read more)

My guess is it's referring to Anthropic's position on SB 1047, or Dario's and Jack Clark's statements that it's too early for strong regulation, or how Anthropic's policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

2Lukas Finnveden
SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.

o3-mini is out (blogpost, tweet). Performance isn't super noteworthy (on first glance), in part since we already knew about o3 performance.

Non-fact-checked quick takes on the system card:

the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)

Big if true (and if Preparedness had time to do elicitation and fix spurious failures)

If this is robust to jailbreaks, great, but presumably it's not, so low post-mitigation performance is far from sufficient for safety-from-misuse;... (read more)

2Mateusz Bagiński
Rushed bc of deepseek?

Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long. 

I wrote this for someone but maybe it's helpful for others

What labs should do:

  • I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
  • Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
    • Control: Redwood blogposts[1] or ask a Redwood human "what's the threat model" and "what are the most promising control techniques"
    • Security: not
... (read more)

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7

Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

Yeah, I agree with this and am a fan of this from the google doc:

Remove biology, technical stuff related to chemical weapons, technical stuff related to nuclear weapons, alignment and AI takeover content (including sci-fi), alignment or AI takeover evaluation content, large blocks of LM generated text, any discussion of LLMs more powerful than GPT2 or AI labs working on LLMs, hacking, ML, and coding from the training set. 

and then fine-tune if you need AIs with specific info. There are definitely issues here with AIs doing safety research (e.g., to solve risks from deceptive alignment they need to know what that is), but this at least buys some marginal safety. 

[Perfunctory review to get this post to the final phase]

Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

One more consideration against (or an important part of "Bureaucracy"): sometimes your lab doesn't let you publish your research.

Yep, the final phase-in date was in November 2024.

Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).

See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

TsviBT204

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, cl... (read more)

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

5johnswentworth
From the post:

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

7johnswentworth
This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.
8Buck
I strongly suspect he thinks most of it is not valuable

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem;... (read more)

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

6TsviBT
Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?
Load More