Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com

Sequences

Slowing AI

Wikitag Contributions

Load More

Comments

Sorted by

My strong upvotes are giving +61 :shrug:

Minor Anthropic RSP update.

Old:

New:

I don't know what e.g. the "4" in "AI R&D-4" means; perhaps it is a mistake.[1]

Sad that the commitment to specify the AI R&D safety case thing was removed, and sad that "accountability" was removed.

Slightly sad that AI R&D capabilities triggering >ASL-3 security went from "especially in the case of dramatic acceleration" to only in that case.

Full AI R&D-5 description from appendix:
 

AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.

The 35x/year scaleup estimate is based on assuming the rate of increase in compute being used to train frontier models from ~2018 to May 2024 is 4.2x/year (reference), the impact of increased (LLM) algorithmic efficiency is roughly equivalent to a further 2.8x/year (reference), and the impact of post training enhancements is a further 3x/year (informal estimate). Combined, these have an effective rate of scaling of 35x/year.

Also, from the changelog:

Iterative Commitment: We have adopted a general commitment to reevaluate our Capability Thresholds whenever we upgrade to a new set of Required Safeguards. We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models; such an approach would add unnecessary complexity because Capability Thresholds do not naturally come grouped in discrete levels. We believe it is more practical and sensible instead to commit to reconsidering the whole list of Capability Thresholds whenever we upgrade our safeguards.

I'm confused by the last sentence.

See also https://www.anthropic.com/rsp-updates.

  1. ^

    Jack Clark misspeaks on twitter, saying "these updates clarify . . . our ASL-4/5 thresholds for AI R&D." But AI R&D-4 and -5 trigger ASL-3 and -4 safeguards, respectively; the RSP doesn't mention ASL-5. Maybe the RSP is wrong and AI R&D-4 is supposed to trigger ASL-4 security, and same for 5 — that would make the terms "AI R&D-4" and "AI R&D-5" make more sense...

Answer by Zach Stein-Perlman*175

A. Many AI safety people don't support relatively responsible companies unilaterally pausing, which PauseAI advocates. (Many do support governments slowing AI progress, or preparing to do so at a critical point in the future. And many of those don't see that as tractable for them to work on.)

B. "Pausing AI" is indeed more popular than PauseAI, but it's not clearly possible to make a more popular version of PauseAI that actually does anything; any such organization will have strategy/priorities/asks/comms that alienate many of the people who think "yeah I support pausing AI."

C. 

There does not seem to be a legible path to prevent possible existential risks from AI without slowing down its current progress.

This seems confused. Obviously P(doom | no slowdown) < 1. Many people's work reduces risk in both slowdown and no-slowdown worlds, and it seems pretty clear to me that most of them shouldn't switch to working on increasing P(slowdown).

Answer by Zach Stein-Perlman40

It is often said that control is for safely getting useful work out of early powerful AI, not making arbitrarily powerful AI safe.

If it turns out large, rapid, local capabilities increase is possible, the leading developer could still opt to spend some inference compute on safety research rather than all on capabilities research.

I think doing 1-week or 1-month tasks reliably would suffice to mostly automate lots of work.

Good point, thanks. I think eventually we should focus more on reducing P(doom | sneaky scheming) but for now focusing on detection seems good.

xAI Risk Management Framework (Draft)

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robust to jailbreaks and xAI won't elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it's so costly that the model doesn't make doing harm cheaper.

(For "Loss of Control," one of the two cited benchmarks was published today—I'm dubious that it measures what we care about but I've only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I'm very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including MicrosoftMetaxAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might be dangerous by default, an evaluation protocol to determine when models reach those thresholds, a plan for how to respond when various thresholds are reached, and accountability measures.

A maximally lazy RSP—a document intended to look like an RSP without making the company do anything differently—would have capability thresholds be vague or extremely high, evaluation be unspecified or low-quality, response be like we will make it safe rather than substantive mitigations or robustness guarantees, and no accountability measures. Such a policy would be little better than the company saying "we promise to deploy AIs safely." The new RSPs are basically like that.[1]

Some aspects of some RSPs that existed before the summit are slightly better.[2]

If existing RSPs are weak, how would a strong RSP be different?

  • Evals: eval should measure relevant capabilities (including cyber, bio, and scheming), evals should be sufficiently difficult, and labs should do good elicitation. (As a lower bar, the evals should exist; many companies say they will do evals but don't seem to have a plan for what evals to do.)
  • Response: misuse
    • Rather than just saying that you'll implement mitigations such that users can't access dangerous capabilities, say how you'll tell if your mitigations are good enough. For example, say that you'll have a skilled red-team attempt to elicit dangerous stuff from the post-mitigation model; ensure that it doesn't provide uplift, or the elicitation required is so involved that doing harm this way is no cheaper than without the AI
  • Response: control
  • Response: security (especially of model weights) — we're very far from securing model weights against determined sophisticated attackers, so:
    • Avoid expanding the Pareto frontier between powerful/dangerous and insecure much
    • Say no AI company has secured their model weights, and this imposes unacceptable risk. Commit that if all other developers were willing to implement super strong security, even though it's costly, you would too.
  • Thresholds
    • Should be low enough that your responses trigger before your models enable catastrophic harm
    • Ideally should be operationalized in evals, but this is genuinely hard
  • Accountability
    • Publish info on evals/elicitation (such that others can tell whether it's adequate)
    • Be transparent to an external auditor; have them review your evals and RSP and publicly comment on (1) adequacy and (2) whether your decisions about not publishing various details are reasonable
    • See also https://metr.org/rsp-key-components/#accountability

(What am I happy about in current RSPs? Briefly and without justification: yay Anthropic and maybe DeepMind on misuse stuff; somewhat yay Anthropic and OpenAI on their eval-commitments; somewhat yay DeepMind and maybe OpenAI on their actual evals; somewhat yay DeepMind on scheming/monitoring/control; maybe somewhat yay DeepMind and Anthropic on security (but not adequate).)

(Companies' other commitments aren't meaningful either.)

  1. ^

    Microsoft is supposed to conduct "robust evaluation of whether a model possesses tracked capabilities at high or critical levels, including through adversarial testing and systematic measurement using state-of-the-art methods," but no details on what evals they'll use. The response to dangerous capabilities is "Further review and mitigations required." The "Security measures" are underspecified but do take the situation seriously. The "Safety mitigations" are less substantive, unfortunately. There's not really accountability. Nothing on alignment or internal deployment.

    Meta has high vague risk thresholds, a vague evaluation plan, vague responses (e.g. "security protections to prevent hacking or exfiltration" and "mitigations to reduce risk to moderate level"), and no accountability. But they do suggest that if they make a model with dangerous capabilities, they won't release the weights and if they do deploy it externally (via API) they'll have decent robustness to jailbreaks — there are loopholes but they hadn't articulated that principle before. Nothing on alignment or internal deployment.

    xAI's policy begins "This is the first draft iteration of xAI’s risk management framework that we expect to apply to future models not currently in development." Misuse evals are cheap (like multiple-choice questions rather than uplift experiments); alignment evals reference unpublished papers and mention "Utility Functions" and "Corrigibility Score." Thresholds would be operationalized as eval results, which is nice, but they don't yet exist. For misuse, includes "Examples of safeguards or mitigations" (but not we'll know mitigations are adequate if a red team fails to break them or other details to suggest mitigations will be effective); no mitigations for alignment.

    Amazon: "Critical Capability Thresholds" are high and vague; evaluation is vague; mitigations are vague ("Upon determining that an Amazon model has reached a Critical Capability Threshold, we will implement a set of Safety Measures and Security Measures to prevent elicitation of the critical capability identified and to protect against inappropriate access risks. Safety Measures are designed to prevent the elicitation of the observed Critical Capabilities following deployment of the model. Security Measures are designed to prevent unauthorized access to model weights or guardrails implemented as part of the Safety Measures, which could enable a malicious actor to remove or bypass existing guardrails to exceed Critical Capability Thresholds."); there's not really accountability. Nothing on alignment or internal deployment. I appreciate the list of current security practices at the end of the document.

  2. ^

    Briefly and without justification:

    OpenAI is similarly vague on thresholds and evals and responses, with no accountability. (Also OpenAI is untrustworthy in general and has a bad history on Preparedness in particular.) But they've done almost-decent evals in the past.

    The DeepMind thing isn't really a commitment, and it doesn't say much about when to do evals, and the security levels are low (but this is no worse than everyone else being vague), and it doesn't have accountability, but it does mention deceptive alignment + control evals + monitoring, and DeepMind has done almost-decent evals in the past.

    The Anthropic thing is fine except it only goes up to ASL-3 and there's no control and little accountability (and the security isn't great (especially for systems after the first systems requiring ASL-3 security) but it's no worse than others).

    See The current state of RSPs modulo the DeepMind FSF update.

  3. ^

    DeepMind says something good on this (but it's not perfect and is only effective if they do good evals sufficiently frequently). Other RSPs don't seriously talk about risks from deceptive alignment.

Load More