LESSWRONG
LW

746
Zach Stein-Perlman
10410Ω3918666512
Message
Dialogue
Subscribe

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com. 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Zach Stein-Perlman's Shortform
Ω
4y
Ω
288
Reasons to sell frontier lab equity to donate now rather than later
Zach Stein-Perlman5d40

Example with fake numbers: my favorite intervention is X. My favorite intervention in a year will probably be (stuff very similar to) X. I value $1 for X now equally to $1.7 for X in a year. I value $1.7 for X in a year equally to $1.4 unrestricted in a year, since it's possible that I'll believe something else is substantially better than X. So I should wait to donate if my expected rate of return is >40%; without this consideration I'd only wait if my expected rate of return is >70%.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman6d20

I don't think it's very new. iirc it's suggested in Meta's safety framework. But past evals stuff (see the first three bullets above) has been more like the model doesn't have dangerous capabilities than the model is weaker than these specific other models. Maybe in part because previous releases have been more SOTA. I don't recall past releases being like safe because weaker than other models.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman6d315

Meta released the weights of a new model and published evals: Code World Model Preparedness Report. It's the best eval report Meta has published to date.

The basic approach is: do evals; find weaker capabilities than other open-weights models; infer that it's safe to release weights.

How good are the evals? Meh. Maybe it's OK if the evals aren't great, since the approach isn't show the model lacks dangerous capabilities but rather show the model is weaker than other models.

One thing that bothered me was this sentence:

Our evaluation approach assumes that a potential malicious user is not an expert in large language model development; therefore, for this assessment we do not include malicious fine-tuning where a malicious user retrains the model to bypass safety post-training or enhance harmful capabilities.

This is totally wrong because for an open-weights model, anyone can (1) undo the safety post-training or (2) post-train on dangerous capabilities, then publish those weights for anyone else to use. I don't know whether any eval results are invalidated by (1): I think for most of the dangerous capability evals Meta uses, models generally don’t refuse them (in some cases because the eval tasks are intentionally merely proxies of dangerous stuff) and so it’s fine to have refusal post-training. And I don't know how important (2) is (perhaps it's fine because the same applies to existing open-weights models). Mostly this sentence just shows that Meta is very confused about safety.


Context:

  • Llama 4: the so-called model card doesn't include results (or even say what the results indicate about CBRN capabilities).
  • Llama 3: see perfunctory model evals for dangerous capabilities in the Llama 3 paper.
  • CyberSecEval: Meta's evals and interpretation have always been very bad.[1]
  • Meta's safety framework is ~meaningless.
  • (Reminder: evals don't really matter. But since the companies all say evals are part of their safety plans, evals can provide a little evidence on safety competence.)

Yay for Meta doing more than for Llama 4. Boo for doing poorly overall and worse than other companies. (And evals stuff doesn't really change the bottom line.)

  1. ^

    In its CyberSecEval 2 evals, Meta found that its models got low scores and concluded "LLMs have a ways to go before performing well on this benchmark, and aren’t likely to disrupt cyber exploitation attack and defense in their present states." Other researchers tried running the evals using basic elicitation techniques: they let the model use chain-of-thought and tools. They found that this increased performance dramatically — the score on one test increased from 5% to 100%. This shows that Meta's use of its results to infer that its models were far from being dangerous was invalid. Later, Meta published CyberSecEval 3: it mentioned the lack of chain of thought and tools as a "limitation," but it used the same methodology as before, so the results still aren't informative about models' true capabilities.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman7dΩ120

Yep, this is what I meant by "labs can increase US willingness to pay for nonproliferation." Edited to clarify.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman7dΩ5100

Suppose that (A) alignment risks do not become compelling-to-almost-all-lab-people and (B) with 10-30x AIs, solving alignment takes like 1-3 years of work with lots of resources.

  • Claim: Safety work during takeoff is crucial. (There will be so much high-quality AI labor happening!)
    • Corollary: Crucial factors are (1) efficiency-of-converting-time-and-compute-into-safety-work-during-takeoff (research directions, training and eliciting AIs to be good at safety work, etc.) and (2) time-and-compute-for-safety-work-during-takeoff.
      • Corollary: A crucial factor (for (2)) is whether US labs are racing against each other or the US is internally coordinated and thinks slowing others is a national security priority and lead time is largely spent on reducing misalignment risk. And so a crucial factor is US government buy-in for nonproliferation. (And labs can increase US willingness to pay for nonproliferation [e.g. demonstrating importance of US lead and of mitigating alignment risk] and decrease the cost of enforcing nonproliferation [e.g. boosting the US military].) With good domestic coordination, you get a surprisingly good story. With no domestic coordination, you get a bad story where the leading lab probably spends ~no lead time focused on alignment.
      • (There are other prioritization implications: security during takeoff is crucial (probably, depending on how exactly the nonproliferation works), getting useful safety work out of your models—and preventing disasters while running them—during takeoff is crucial, idk.)

I feel like this is important and underappreciated. I also feel like I'm probably somewhat confused about this. I might write a post on this but I'm shipping it as a shortform because (a) I might not and (b) this might elicit feedback.

Reply
AI #135: OpenAI Shows Us The Money
Zach Stein-Perlman7d50

Google Strengthens Its Safety Framework

Hmm, I think v3 is worse than v2. The change that's most important to me is that the section on alignment is now merely "exploratory" and "illustrative." (On the other hand, it is nice that v3 mentions misalignment as a potential risk from ML R&D capabilities in addition to "instrumental reasoning" / stealth-and-sabotage capabilities; previously it was just the latter.) Note I haven't read v3 carefully.

(But both versions, like other companies' safety frameworks, are sufficiently weak or lacking-transparency that I don't really care about marginal changes.)

Reply
"Shut It Down" is simpler than "Controlled Takeoff"
Zach Stein-Perlman8d20

I don't think so. I mean globally controlled takeoff where the US-led-coalition is in charge.

Reply
"Shut It Down" is simpler than "Controlled Takeoff"
Zach Stein-Perlman8d60

See footnote 11. One-sentence version: US and allies enforce control on hardware, domestically and abroad, and there's carrots for cooperating and large sticks for not cooperating. Beyond that, not worth getting into / it would take me a long time to articulate something helpful. But happy to chat live, e.g. call me tomorrow.

Reply1
Mikhail Samin's Shortform
Zach Stein-Perlman8d250

https://www.nytimes.com/books/best-sellers/combined-print-and-e-book-nonfiction/

Reply
"Shut It Down" is simpler than "Controlled Takeoff"
Zach Stein-Perlman8d100

Note that there are two different ways to control the compute: global cooperation or US-led entente (I don't have a good link on entente but see here).

Reply
Load More
Ontology
2 years ago
(+45)
Ontology
2 years ago
(-5)
Ontology
2 years ago
Ontology
2 years ago
(+64/-64)
Ontology
2 years ago
(+45/-12)
Ontology
2 years ago
(+64)
Ontology
2 years ago
(+66/-8)
Ontology
2 years ago
(+117/-23)
Ontology
2 years ago
(+58/-21)
Ontology
2 years ago
(+41)
Load More
43AI companies' policy advocacy (Sep 2025)
3d
0
104xAI's new safety framework is dreadful
1mo
5
52AI companies have started saying safeguards are load-bearing
Ω
1mo
Ω
2
15ChatGPT Agent: evals and safeguards
2mo
0
33Epoch: What is Epoch?
3mo
1
15AI companies aren't planning to secure critical model weights
3mo
0
207AI companies' eval reports mostly don't support their claims
Ω
4mo
Ω
13
58New website analyzing AI companies' model evals
4mo
0
72New scorecard evaluating AI companies on safety
4mo
8
71Claude 4
4mo
24
Load More