Anthropic | Charting a Path to AI Accountability

Gabe M

Below is the text copied from the linked post (since it's short). I'm not affiliated with Anthropic but wanted to link-post this here for discussion.

This week, Anthropic submitted a response to the National Telecommunications and Information Administration’s (NTIA) Request for Comment on AI Accountability. Today, we want to share our recommendations as they capture some of Anthropic’s core AI policy proposals.

There is currently no robust and comprehensive process for evaluating today’s advanced artificial intelligence (AI) systems, let alone the more capable systems of the future. Our submission presents our perspective on the processes and infrastructure needed to ensure AI accountability. Our recommendations consider the NTIA’s potential role as a coordinating body that sets standards in collaboration with other government agencies like the National Institute of Standards and Technology (NIST).

In our recommendations, we focus on accountability mechanisms suitable for highly capable and general-purpose AI models. Specifically, we recommend:

Fund research to build better evaluations
- Increase funding for AI model evaluation research. Developing rigorous, standardized evaluations is difficult and time-consuming work that requires significant resources. Increased funding, especially from government agencies, could help drive progress in this critical area.
- Require companies in the near-term to disclose evaluation methods and results. Companies deploying AI systems should be mandated to satisfy some disclosure requirements with regard to their evaluations, though these requirements need not be made public if doing so would compromise intellectual property (IP) or confidential information. This transparency could help researchers and policymakers better understand where existing evaluations may be lacking.
- Develop in the long term a set of industry evaluation standards and best practices. Government agencies like NIST could work to establish standards and benchmarks for evaluating AI models’ capabilities, limitations, and risks that companies would comply with.
Create risk-responsive assessments based on model capabilities
- Develop standard capabilities evaluations for AI systems. Governments should fund and participate in the development of rigorous capability and safety evaluations targeted at critical risks from advanced AI, such as deception and autonomy. These evaluations can provide an evidence-based foundation for proportionate, risk-responsive regulation.
- Develop a risk threshold through more research and funding into safety evaluations. Once a risk threshold has been established, we can mandate evaluations for all models against this threshold.
  - If a model falls below this risk threshold, existing safety standards are likely sufficient. Verify compliance and deploy.
  - If a model exceeds the risk threshold and safety assessments and mitigations are insufficient, halt deployment, significantly strengthen oversight, and notify regulators. Determine appropriate safeguards before allowing deployment.
Establish pre-registration for large AI training runs
- Establish a process for AI developers to report large training runs ensuring that regulators are aware of potential risks. This involves determining the appropriate recipient, required information, and appropriate cybersecurity, confidentiality, IP, and privacy safeguards.
- Establish a confidential registry for AI developers conducting large training runs to pre-register model details with their home country’s national government (e.g., model specifications, model type, compute infrastructure, intended training completion date, and safety plans) before training commences. Aggregated registry data should be protected to the highest available standards and specifications.
Empower third party auditors that are…
- Technically literate – at least some auditors will need deep machine learning experience;
- Security-conscious – well-positioned to protect valuable IP, which could pose a national security threat if stolen; and
- Flexible – able to conduct robust but lightweight assessments that catch threats without undermining US competitiveness.

Mandate external red teaming before model release
- Mandate external red teaming for AI systems, either through a centralized third party (e.g., NIST) or in a decentralized manner (e.g., via researcher API access) to standardize adversarial testing of AI systems. This should be a precondition for developers who are releasing advanced AI systems.
- Establish high-quality external red teaming options before they become a precondition for model release. This is critical as red teaming talent currently resides almost exclusively within private AI labs.
Advance interpretability research
- Increase funding for interpretability research. Provide government grants and incentives for interpretability work at universities, nonprofits, and companies. This would allow meaningful work to be done on smaller models, enabling progress outside frontier labs.
- Recognize that regulations demanding interpretable models would currently be infeasible to meet, but may be possible in the future pending research advances.
Enable industry collaboration on AI safety via clarity around antitrust
- Regulators should issue guidance on permissible AI industry safety coordination given current antitrust laws. Clarifying how private companies can work together in the public interest without violating antitrust laws would mitigate legal uncertainty and advance shared goals.

We believe this set of recommendations will bring us meaningfully closer to establishing an effective framework for AI accountability. Doing so will require collaboration between researchers, AI labs, regulators, auditors, and other stakeholders. Anthropic is committed to supporting efforts to enable the safe development and deployment of AI systems. Evaluations, red teaming, standards, interpretability and other safety research, auditing, and strong cybersecurity practices are all promising avenues for mitigating the risks of AI while realizing its benefits.

We believe that AI could have transformative effects in our lifetime and we want to ensure that these effects are positive. The creation of robust AI accountability and auditing mechanisms will be vital to realizing this goal. We are grateful for the chance to respond to this Request For Comment.

You can read our submission in full here.

[-]Gabe M2y2119

1. These seem like quite reasonable things to push for, I'm overall glad Anthropic is furthering this "AI Accountability" angle.

2. A lot of the interventions they recommend here don't exist/aren't possible yet.

3. But the keyword is yet: If you have short timelines and think technical researchers may need to prioritize work with positive AI governance externalities, there are many high-level research directions to consider focusing on here.

Empower third party auditors that are… Flexible – able to conduct robust but lightweight assessments that catch threats without undermining US competitiveness.

4. This competitiveness bit seems like clearly-tacked on US government appeasement, it's maybe a bad precedent to be putting loopholes into auditing based on national AI competitiveness, particularly if an international AI arms race accelerates.

Increase funding for interpretability research. Provide government grants and incentives for interpretability work at universities, nonprofits, and companies. This would allow meaningful work to be done on smaller models, enabling progress outside frontier labs.

5. Similarly, I'm not entirely certain if massive funding for interpretability work is the best idea. Anthropic's probably somewhat biased here as an organization that really likes interpretability, but it seems possible that interpretability work could be hazardous (mostly by leading to insights that accelerate algorithmic progress that shortens timelines), especially if it's published openly (which I imagine academia especially but also some of those other places would like to do).

[-]Raemon1y20

I held off on writing a comment here because I felt like I should thoroughly read the linked thing before having an opinion, but then it turned out that was a lot of work so I didn't.

I'm hoping to read more details this week as part of LessWrong Review time, but not sure if I'll get to it.

LESSWRONG
LW

34

Anthropic | Charting a Path to AI Accountability

34

34