Associated with AI Lab Watch, I sent questions to some labs a week ago (except I failed to reach Microsoft). I didn't really get any replies (one person replied in their personal capacity; this was very limited and they didn't answer any questions). Here are most of those questions, with slight edits since I shared them with the labs + questions I asked multiple labs condensed into the last two sections.

Lots of my questions are normal I didn't find public info on this safety practice and I think you should explain questions. Some are more like it's pretty uncool that I can't find the answer to this — like: breaking commitments, breaking not-quite-commitments and not explaining, having ambiguity around commitments, and taking credit for stuff[1] when it's very unclear that you should get credit are pretty uncool.

Anthropic

Internal governance stuff (I'm personally particularly interested in these questions—I think Anthropic has tried to set up great internal governance systems and maybe it has succeeded but it needs to share more information for that to be clear from the outside):

  • Who is on the board and what's up with the LTBT?[2] In September, Vox reported "The Long-Term Benefit Trust . . . will elect a fifth member of the board this fall." Did that happen? (If so: who is it? when did this happen? why haven't I heard about this? If not: did Vox hallucinate this or did your plans change (and what is the plan)?)
  • What are the details on the "milestones" for the LTBT and how stockholders can change/abrogate the LTBT? Can you at least commit that we'd quickly hear about it if stockholders changed/abrogated the LTBT? (Why hasn't this been published?)
  • What formal powers do investors/stockholders have, besides abrogating the LTBT? (can they replace the two board members who represent them? can they replace other board members?)
  • What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete?
    • I'm confused about what such balancing-of-interests entails. Oh well.
  • Who holds Anthropic shares + how much? At least: how much is Google + Amazon?

Details of when the RSP triggers evals:

  • "During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements." Assuming effective compute scales less than 4x per 3 months, the 4x part will never matter, right? (And insofar as AI safety people fixate on the "4x" condition, they are incorrect to do so?) Or do you have different procedures for a 4x-eval vs a 3-month-eval, e.g. the latter uses the old model just with new finetuning/prompting/scaffolding/etc.?
  • Evaluation during deployment? I am concerned that improvements in fine-tuning and inference-time enhancements (prompting, scaffolding, etc.) after a model is deployed will lead to dangerous capabilities. Especially if models can be updated to increase their capabilities without evals.
    • Do you do the evals during deployment?
    • The RSP says "If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will" do stuff. How would that become apparent — via the regular evals or ad-hoc just-noticing?
    • If you do do evals during deployment: suppose you have two models such that each is better than the other at some tasks (perhaps because a powerful model is deployed and a new model is in progress with a new training setup). Every 3 months, would you do full evals on both models, or what?

Deployment commitments: does Anthropic consider itself bound by any commitments about deployment it has made in the past besides those in its RSP (and in particular not meaningfully advancing the frontier)? (Why hasn't it clarified this after the confusion around Anthropic's commitments after the Claude 3 launch?)

You shared Claude 2 with METR. Did you let METR or any other external parties do model evals for dangerous capabilities on Claude 3 before deployment? Do you plan for such things in the future? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

  • [new] You've suggested that sharing models before deployment is a hard engineering problem.[3] I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning. I think my technically inclined friends have the same attitude. If it's actually a hard engineering problem, it would be nice to explain why.

I think it would be really cool if y'all said more about lots of safety stuff, to help other labs do better. Like, safety evals, and red-teaming, and the automated systems mentioned here, and all the other safety stuff you do.

OpenAI

Preparedness Framework:

  • How does the PF interact with sharing models with others, especially Microsoft? My impression is that OpenAI is required to share dangerous models with Microsoft, and of course the PF doesn't bind Microsoft, so the PF is consistent with Microsoft deploying OpenAI's models unsafely.
  • Do you believe that a system just below the Critical threshold in each risk category is extremely unlikely to be sufficiently capable to cause a global catastrophe? (if yes, I'd like to argue with you about that. if no, don't the thresholds need to be lower?)
  • The beta PF says we need dependable evidence that the model is sufficiently aligned that it does not initiate “critical”-risk-level tasks unless explicitly instructed to do so. I'm not sure what this really looks like. How could you get dependable evidence, particularly given the possibility that the model is scheming?
    • There are multiple ways to interpret the italicized sentence and it seems important to make sure everyone is on the same page.
  • The deployment commitments in the PF seem to just refer to external deployment. (This isn't super clear, but it says "Deployment in this case refers to the spectrum of ways of releasing a technology for external impact.") (This contrasts with Anthropic's RSP, in which "deployment" includes internal use.) Do you commit to safeguards around internal deployment (beyond the commitments about development)?
  • The PF could entail pausing deployment or development. Does OpenAI have a plan for the details of what it would do if it needed to pause for safety? (E.g.: what would staff work on during a pause? does OpenAI stay financially prepared for a pause?)
  • "We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training."
    • How can you run evals before training?
    • Will you run evals during deployment?
      • If so, what if you have two models such that each is better than the other at some tasks (perhaps because a powerful model is deployed and a new model is in progress with a new training setup). Would you do full evals on both models, or what?
  • Can you commit that you'll publish changes to the PF before adopting them, and ideally seek feedback from stakeholders? Or at least that you'll publish changes when you adopt them?

Internal governance:

  • OpenAI recently said:
    Key enhancements [to OpenAI governance] include:
    * Adopting a new set of corporate governance guidelines;
    * Strengthening OpenAI’s Conflict of Interest Policy;
    * Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors; and
    * Creating additional Board committees, including a Mission & Strategy committee focused on implementation and advancement of the core mission of OpenAI.
    Can you share details? I'm particularly interested in details on the "corporate governance guidelines" and the whistleblower policy.
  • What does OpenAI owe to its investors? (any fiduciary duty? any other promises or obligations?) Do investors have any formal powers?
  • What powers does the board have, besides what's mentioned in the PF?

What are your plans for the profit cap for investors?

You shared GPT-4 with METR. Are you planning to let METR or any other external parties do model evals for dangerous capabilities before future deployments? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

DeepMind

DeepMind and Google have various councils and teams related to safety (see e.g. here): the Responsible AI Council, the Responsible Development and Innovation team, the Responsibility and Safety Council, the Advanced Technology Review Council, etc. From the outside, it's difficult to tell whether they actually improve safety. Can you point me to details on what they do, what exactly they're responsible for, what their powers are, etc.?

My impression is that when push comes to shove, Google can do whatever it wants with DeepMind; DeepMind and its leadership have no hard power. Is this correct? If it is mistaken, can you clarify the relationship between DeepMind and Google?

I hope DeepMind will soon make RSP-y commitments — like, dangerous capability evals before deployment + at least one risk threshold and how it would affect deployment decisions + a plan for making safety arguments after models have dangerous capabilities.

I wish DeepMind (or Google) would articulate a safety plan in the company's voice, clearly supported by leadership, rather than leaving DeepMind safety folks to do so in their personal voices.

I'm kind of confused about the extent to which DeepMind and Google have a mandate for preventing extreme risks and sharing the benefits of powerful AI. Would such a mandate be expressed in places besides DeepMind's About page, Google's AI Principles, and Google's Responsible AI Practices? Do any internal oversight bodies have explicit mandates along these lines?

What's the status of DeepMind's old Operating Principles?

Terms of service:

  • "You may not use the Services to develop models that compete with Gemini API or Google AI Studio. You also may not attempt to extract or replicate the underlying models (e.g., parameter weights)." Do you enforce this? How? Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)?
  • "The Services include safety features to block harmful content, such as content that violates our Prohibited Use Policy. You may not attempt to bypass these protective measures or use content that violates the API Terms or these Additional Terms." Do you enforce this? How?
  • Do you enforce your Generative AI Prohibited Use Policy? How?

Microsoft

Microsoft says:

When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. We established the joint DSB’s processes in 2021, anticipating a need for a comprehensive pre-release review process focused on AI safety and alignment, well ahead of regulation or external commitments mandating the same.

We have exercised this review process with respect to several frontier models, including GPT-4. Using Microsoft’s Responsible AI Standard and OpenAI’s experience building and deploying advanced AI systems, our teams prepare detailed artefacts for the joint DSB review. Artefacts record the process by which our organizations have mapped, measured, and managed risks, including through the use of adversarial testing and third-party evaluations as appropriate. We continue to learn from, and refine, the joint DSB process, and we expect it to evolve over time.

How do you measure capabilities? What are the capability thresholds? How does the review work? Can you share the artifacts? What are other details I'd want to know?

Meta

Is there a risk or capability threshold beyond which Meta AI would stop releasing model weights? What could lead Meta AI to stop releasing model weights?

Does Meta AI measure the risk of autonomous replication, per METR and the white house commitments? Does it plan to?

OpenAI-Microsoft relationship

  • What access does Microsoft have to OpenAI's models and IP?
  • What exactly is Microsoft owed or promised by OpenAI?
  • What is your joint "Deployment Safety Board"? How does it work?

[At least three different labs]

[new] Did you commit to share models with UKAISI before deployment? (I suspect Rishi Sunak or Politico hallucinated this.)

Do you let external parties do model evals for dangerous capabilities before you deploy models? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)? Do you enforce any related provisions in your terms of service?

Do you have a process for staff to follow if they have concerns about safety? (Where have you written about it, or can you share details?) What about concerns/suggestions about risk assessment policies or their implementation?

Do you ever keep research private for safety reasons? How do you decide what research to publish; how do safety considerations determine what you publish?

Do AI oversight bodies within the lab have a mandate for safety and preventing extreme risks? Please share details on how they work.

Security:

  • Can you commit to publicly disclose all breaches of your security?
  • Can you publish the reports related to your security certifications/audits/pentests (redacting sensitive details but not the overall evaluation)?
  • Do you limit uploads from clusters with model weights? How?
  • Do you use multiparty access controls? For what? How many people have access to model weights? Would your controls actually stop a compromised staff member from accessing the weights?
  • Do you secure developers' machines? How?
  • How hard would it be for China to exfiltrate model weights that you were trying to protect?

Do you require nontrivial KYC for some types of model access? If so, please explain or point me to details.

What do you do to improve adversarial robustness, e.g. to prevent jailbreaking? (During training and especially at inference-time.)

Can you promise that you don't use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the company)?

I'm currently uncertain about how system prompts can prevent misuse. Maybe they can't and I won't write about this. But in case they can: How do you think about system prompts and preventing misuse?


(Note: if a public source is kinda related to a question but doesn't directly answer it, probably I'm already aware of it and mention it on AI Lab Watch.)

  1. ^

    I'm thinking of internal governance stuff and risk assessment / RSP-y stuff.

  2. ^

    At least among my friends, Anthropic gets way more credit for LTBT than it deserves based on public information. For all we know publicly, Google and Amazon can abrogate the LTBT at will! Failing to share the details on the LTBT but claiming credit for the LTBT being great is not cool.

  3. ^

    Anthropic:

    Engaging external experts has the advantage of leveraging specialized domain expertise and increasing the likelihood of an unbiased audit. Initially, we expected this collaboration [with METR] to be straightforward, but it ended up requiring significant science and engineering support on our end. Providing full-time assistance diverted resources from internal evaluation efforts.

    Anthropic cofounder Jack Clark:

    Pre-deployment testing is a nice idea but very difficult to implement.

New Comment
10 comments, sorted by Click to highlight new comments since:
[-]Akash110

@Zach Stein-Perlman, great work on this. I would be interested in you brainstorming some questions that have to do with the lab's stances toward (government) AI policy interventions.

After a quick 5 min brainstorm, here are some examples of things that seem relevant:

  • I remember hearing that OpenAI lobbied against the EU AI Act– what's up with that?
  • I heard a rumor that Congresspeople and their teams reached out to Sam/OpenAI after his testimony. They allegedly asked for OpenAI's help to craft legislation around licensing, and then OpenAI refused. Is that true? 
  • Sam said we might need an IAEA for AI at some point– what did he mean by this? At what point would he see that as valuable?
  • In general, what do labs think the US government should be doing? What proposals would they actively support or even help bring about? (Flagging ofc that there are concerns about actual and perceived regulatory capture, but there are also major advantages to having industry players support & contribute to meaningful regulation).
  • Senator Cory Booker recently asked Jack Clark something along the lines of "what is your top policy priority right now//what would you do if you were a Senator." Jack responded with something along the lines of "I would make sure the government can deploy AI successfully. We need a testing regime to better understand risks, but the main risk is that we don't use AI enough, and we need to make sure we stay at the cutting edge." What's up with that?
  • Why haven't Dario and Jack made public statements about specific government interventions? Do they believe that there are some circumstances under which a moratorium would need to be implemented, labs would need to be nationalized (or internationalized), or something else would need to occur to curb race dynamics? (This could be asked to any of the lab CEOs/policy team leads– I don't mean to be picking on Anthropic, though I think Sam/OpenAI have had more public statements here, and I think the other labs are scoring more poorly across the board//don't fully buy into the risks in the first place.)
  • Big tech is spending a lot of money on AI lobbying. How much is each lab spending (this is something you can estimate with publicly available data), and what are they actually lobbying for/against?

I imagine there's a lot more in this general category of "labs and how they are interacting with governments and how they are contributing to broader AI policy efforts", and I'd be excited to see AI Lab Watch (or just you) dive into this more.

Thanks. Briefly:

I'm not sure what the theory of change for listing such questions is.

In the context of policy advocacy, think it's sometimes fine/good for labs to say somewhat different things publicly vs privately. Like, if I was in charge of a lab and believed (1) the EU AI Act will almost certainly pass and (2) it has some major bugs that make my life harder without safety benefits, I might publicly say "I support (the goals of) the EU AI Act" and privately put some effort into removing those bugs, which is technically lobbying to weaken the Act.

(^I'm not claiming that particular labs did ~this rather than actually lobby against the Act. I just think it's messy and regulation isn't a one-dimensional thing that you're for or against.)

Edit: this comment was misleading and partially replied to a strawman. I agree it would be good for the labs and their leaders to publicly say some things about recommended regulation (beyond what they already do) and their lobbying. I'm nervous about trying to litigate rumors for reasons I haven't explained.

Edit 2: based on https://corporateeurope.org/en/2023/11/byte-byte, https://time.com/6288245/openai-eu-lobbying-ai-act/, and background information, I believe that OpenAI, Microsoft, Google, and Meta privately lobbied to make the EU AI Act worse—especially by lobbying against rules for foundation models—and that this is inconsistent with OpenAI's and Altman's public statements.

Right now, I think one of the most credible ways for a lab to show its committment to safety is through its engagement with governments.

I didn’t mean to imply that a lab should automatically be considered “bad” if its public advocacy and its private advocacy differ.

However, when assessing how “responsible” various actors are, I think investigating questions relating to their public comms, engagement with government, policy proposals, lobbying efforts, etc would be valuable.

If Lab A had slightly better internal governance but lab B had better effects on “government governance”, I would say that lab B is more “responsible” on net.

Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic's policies. Two pieces of not-previously-public information:

  • I was disappointed that Anthropic's Responsible Scaling Policy only mentions evaluation "During model training and fine-tuning." Zac told me "this was a simple drafting error - our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we've been acting accordingly all along." Yay.
  • I said labs should have a "process for staff to escalate concerns about safety" and "have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously." I noted that Anthropic's RSP includes a commitment to "Implement a non-compliance reporting policy." Zac told me "Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1." Yay.

I think it's cool that Zac replied (but most of my questions for Anthropic remain).

I have not yet received substantive corrections/clarifications from any other labs.

(I have made some updates to ailabwatch.org based on Zac's feedback—and revised Anthropic's score from 45 to 48—but have not resolved all of it.)

I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning

Fine-tuning access can be extremely expensive if implemented naively and it's plausible that cheap (LoRA) fine-tuning isn't even implemented for new models internally for a while at AI labs. If you make the third party groups pay for it than I suppose this isn't a problem, but the costs could be vast.

I agree that inference access is cheap.

What do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?

Given the amount of internal fine-tuning experiments going on for safety stuff, I'd be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.

I'd be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities' evaluations, you rarely need massive training runs).

Compute for doing inference on the weights if you don't have LoRA finetuning set up properly.

My implicit claim is that there maybe isn't that much fine-tuning stuff internally.

Isn't that only ~10x more expensive than running the forward-passes (even if you don't do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).

What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete?

This sounds like one of those questions that's a terrible idea to answer in writing without extensive consultation with your legal department. How long was the time period between when you asked this question and when you made this post?

This post is not trying to shame labs for failing to answer before; I didn't try hard to get them to answer. (The period was one week but I wasn't expecting answers to my email / wouldn't expect to receive a reply even if I waited longer.)

(Separately, I kinda hope the answers to basic questions like this are already written down somewhere...)