@Zach Stein-Perlman, great work on this. I would be interested in you brainstorming some questions that have to do with the lab's stances toward (government) AI policy interventions.
After a quick 5 min brainstorm, here are some examples of things that seem relevant:
I imagine there's a lot more in this general category of "labs and how they are interacting with governments and how they are contributing to broader AI policy efforts", and I'd be excited to see AI Lab Watch (or just you) dive into this more.
Thanks. Briefly:
I'm not sure what the theory of change for listing such questions is.
In the context of policy advocacy, think it's sometimes fine/good for labs to say somewhat different things publicly vs privately. Like, if I was in charge of a lab and believed (1) the EU AI Act will almost certainly pass and (2) it has some major bugs that make my life harder without safety benefits, I might publicly say "I support (the goals of) the EU AI Act" and privately put some effort into removing those bugs, which is technically lobbying to weaken the Act.
(^I'm not claiming that particular labs did ~this rather than actually lobby against the Act. I just think it's messy and regulation isn't a one-dimensional thing that you're for or against.)
Edit: this comment was misleading and partially replied to a strawman. I agree it would be good for the labs and their leaders to publicly say some things about recommended regulation (beyond what they already do) and their lobbying. I'm nervous about trying to litigate rumors for reasons I haven't explained.
Edit 2: based on https://corporateeurope.org/en/2023/11/byte-byte, https://time.com/6288245/openai-eu-lobbying-ai-act/, and background information, I believe that OpenAI, Microsoft, Google, and Meta privately lobbied to make the EU AI Act worse—especially by lobbying against rules for foundation models—and that this is inconsistent with OpenAI's and Altman's public statements.
Right now, I think one of the most credible ways for a lab to show its committment to safety is through its engagement with governments.
I didn’t mean to imply that a lab should automatically be considered “bad” if its public advocacy and its private advocacy differ.
However, when assessing how “responsible” various actors are, I think investigating questions relating to their public comms, engagement with government, policy proposals, lobbying efforts, etc would be valuable.
If Lab A had slightly better internal governance but lab B had better effects on “government governance”, I would say that lab B is more “responsible” on net.
Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic's policies. Two pieces of not-previously-public information:
I think it's cool that Zac replied (but most of my questions for Anthropic remain).
I have not yet received substantive corrections/clarifications from any other labs.
(I have made some updates to ailabwatch.org based on Zac's feedback—and revised Anthropic's score from 45 to 48—but have not resolved all of it.)
I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning
Fine-tuning access can be extremely expensive if implemented naively and it's plausible that cheap (LoRA) fine-tuning isn't even implemented for new models internally for a while at AI labs. If you make the third party groups pay for it than I suppose this isn't a problem, but the costs could be vast.
I agree that inference access is cheap.
What do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?
Given the amount of internal fine-tuning experiments going on for safety stuff, I'd be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.
I'd be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities' evaluations, you rarely need massive training runs).
Compute for doing inference on the weights if you don't have LoRA finetuning set up properly.
My implicit claim is that there maybe isn't that much fine-tuning stuff internally.
Maybe setting up custom fine-tuning is hard and labs often only set it up during deployment...
(Separately, it would be nice if OpenAI and Anthropic let some safety researchers do fine-tuning now.)
Isn't that only ~10x more expensive than running the forward-passes (even if you don't do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).
What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete?
This sounds like one of those questions that's a terrible idea to answer in writing without extensive consultation with your legal department. How long was the time period between when you asked this question and when you made this post?
This post is not trying to shame labs for failing to answer before; I didn't try hard to get them to answer. (The period was one week but I wasn't expecting answers to my email / wouldn't expect to receive a reply even if I waited longer.)
(Separately, I kinda hope the answers to basic questions like this are already written down somewhere...)
Associated with AI Lab Watch, I sent questions to some labs a week ago (except I failed to reach Microsoft). I didn't really get any replies (one person replied in their personal capacity; this was very limited and they didn't answer any questions). Here are most of those questions, with slight edits since I shared them with the labs + questions I asked multiple labs condensed into the last two sections.
Lots of my questions are normal I didn't find public info on this safety practice and I think you should explain questions. Some are more like it's pretty uncool that I can't find the answer to this — like: breaking commitments, breaking not-quite-commitments and not explaining, having ambiguity around commitments, and taking credit for stuff[1] when it's very unclear that you should get credit are pretty uncool.
Anthropic
Internal governance stuff (I'm personally particularly interested in these questions—I think Anthropic has tried to set up great internal governance systems and maybe it has succeeded but it needs to share more information for that to be clear from the outside):
Details of when the RSP triggers evals:
Deployment commitments: does Anthropic consider itself bound by any commitments about deployment it has made in the past besides those in its RSP (and in particular not meaningfully advancing the frontier)? (Why hasn't it clarified this after the confusion around Anthropic's commitments after the Claude 3 launch?)
You shared Claude 2 with METR. Did you let METR or any other external parties do model evals for dangerous capabilities on Claude 3 before deployment? Do you plan for such things in the future? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)
I think it would be really cool if y'all said more about lots of safety stuff, to help other labs do better. Like, safety evals, and red-teaming, and the automated systems mentioned here, and all the other safety stuff you do.
OpenAI
Preparedness Framework:
Internal governance:
Key enhancements [to OpenAI governance] include:
* Adopting a new set of corporate governance guidelines;
* Strengthening OpenAI’s Conflict of Interest Policy;
* Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors; and
* Creating additional Board committees, including a Mission & Strategy committee focused on implementation and advancement of the core mission of OpenAI.
Can you share details? I'm particularly interested in details on the "corporate governance guidelines" and the whistleblower policy.
What are your plans for the profit cap for investors?
You shared GPT-4 with METR. Are you planning to let METR or any other external parties do model evals for dangerous capabilities before future deployments? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)
DeepMind
DeepMind and Google have various councils and teams related to safety (see e.g. here): the Responsible AI Council, the Responsible Development and Innovation team, the Responsibility and Safety Council, the Advanced Technology Review Council, etc. From the outside, it's difficult to tell whether they actually improve safety. Can you point me to details on what they do, what exactly they're responsible for, what their powers are, etc.?
My impression is that when push comes to shove, Google can do whatever it wants with DeepMind; DeepMind and its leadership have no hard power. Is this correct? If it is mistaken, can you clarify the relationship between DeepMind and Google?
I hope DeepMind will soon make RSP-y commitments — like, dangerous capability evals before deployment + at least one risk threshold and how it would affect deployment decisions + a plan for making safety arguments after models have dangerous capabilities.
I wish DeepMind (or Google) would articulate a safety plan in the company's voice, clearly supported by leadership, rather than leaving DeepMind safety folks to do so in their personal voices.
I'm kind of confused about the extent to which DeepMind and Google have a mandate for preventing extreme risks and sharing the benefits of powerful AI. Would such a mandate be expressed in places besides DeepMind's About page, Google's AI Principles, and Google's Responsible AI Practices? Do any internal oversight bodies have explicit mandates along these lines?
What's the status of DeepMind's old Operating Principles?
Terms of service:
Microsoft
Microsoft says:
How do you measure capabilities? What are the capability thresholds? How does the review work? Can you share the artifacts? What are other details I'd want to know?
Meta
Is there a risk or capability threshold beyond which Meta AI would stop releasing model weights? What could lead Meta AI to stop releasing model weights?
Does Meta AI measure the risk of autonomous replication, per METR and the white house commitments? Does it plan to?
OpenAI-Microsoft relationship
[At least three different labs]
[new] Did you commit to share models with UK AISI before deployment? (I suspect Rishi Sunak or Politico hallucinated this.)
Do you let external parties do model evals for dangerous capabilities before you deploy models? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)
Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)? Do you enforce any related provisions in your terms of service?
Do you have a process for staff to follow if they have concerns about safety? (Where have you written about it, or can you share details?) What about concerns/suggestions about risk assessment policies or their implementation?
Do you ever keep research private for safety reasons? How do you decide what research to publish; how do safety considerations determine what you publish?
Do AI oversight bodies within the lab have a mandate for safety and preventing extreme risks? Please share details on how they work.
Security:
Do you require nontrivial KYC for some types of model access? If so, please explain or point me to details.
What do you do to improve adversarial robustness, e.g. to prevent jailbreaking? (During training and especially at inference-time.)
Can you promise that you don't use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the company)?
I'm currently uncertain about how system prompts can prevent misuse. Maybe they can't and I won't write about this. But in case they can: How do you think about system prompts and preventing misuse?
(Note: if a public source is kinda related to a question but doesn't directly answer it, probably I'm already aware of it and mention it on AI Lab Watch.)
I'm thinking of internal governance stuff and risk assessment / RSP-y stuff.
At least among my friends, Anthropic gets way more credit for LTBT than it deserves based on public information. For all we know publicly, Google and Amazon can abrogate the LTBT at will! Failing to share the details on the LTBT but claiming credit for the LTBT being great is not cool.
Anthropic:
Anthropic cofounder Jack Clark: