Frontier AI labs can boost external safety researchers by

  • Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]
  • Releasing research artifacts besides models
  • Publishing (transparent, reproducible) safety research
  • Giving API credits
  • Mentoring

Here's what the labs have done (besides just publishing safety research[3]).

Anthropic:

Google DeepMind:

  • Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
  • Releasing Gemma SAEs
  • Releasing Gemma weights
  • (External mentoring, in particular via MATS)
  • [No fine-tuning or deep access to frontier models]

OpenAI:

Meta AI:

Microsoft:

  • [Nothing]

xAI:

  • [Nothing]

Related papers:

  1. ^

    "Helpful-only" refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.

  2. ^

    Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta's poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.

  3. ^

    And an unspecified amount of funding Frontier Model Forum grants.

New Comment
1 comment, sorted by Click to highlight new comments since:

Yeah this seems like a good point. Not a lot to argue with, but yeah underrated.