A comment provided to me by a reader, highlighting 3rd party liability and insurance as interventions too (lightly edited):
Hi! I liked your AI regulator’s toolbox post – very useful to have a comprehensive list like this! I'm not sure exactly what heading it should go under, but I suggest considering adding proposals to greatly increase 3rd party liability (and or require carrying insurance). A nice intro is here:
https://www.lawfaremedia.org/article/tort-law-and-frontier-ai-governanceSome are explicitly proposing strict liability for catastrophic risks. Gabe Weil has proposal, summarized here: https://www.lesswrong.com/posts/5e7TrmH7mBwqpZ6ek/tort-law-can-play-an-important-role-in-mitigating-ai-risk
There are also workshop papers on insurance here:
https://www.genlaw.org/2024-icml-papers#liability-and-insurance-for-catastrophic-losses-the-nuclear-power-precedent-and-lessons-for-ai
https://www.genlaw.org/2024-icml-papers#insuring-uninsurable-risks-from-ai-government-as-insurer-of-last-resortNB: when implemented correctly (i.e. when premiums are accurately risk-priced), insurance premiums are mechanically similar to Pigouvian taxes, internalizing negative externalities. So maybe this goes under the "Other taxes" heading? But that also seems odd. Like taxes, these are certainly incentive alignment strategies (rather than command and control) – maybe that's a better heading? Just spitballing :)
This article explains concrete AI governance practices people are exploring as of August 2024.
Prior summaries have mapped out high-level areas of work, but rarely dive into concrete practice details. For example, they might describe roles in policy development, advocacy and implementation - but don’t specify what practices these policies might try to implement.
This summary explores specific practices addressing risks from advanced AI systems. Practices are grouped into categories based on where in the AI lifecycle they best ‘fit’ - although many practices are relevant at multiple stages. Within each group, practices are simply sorted alphabetically.
Each practice's explanation should be readable independently. You can navigate and get a direct link to a section by clicking the section in the left sidebar.
The primary goal of this article is to help newcomers contribute to the field.[1] Readers are encouraged to leave comments where they are confused. A secondary goal is to help align the use of terms within the field, to help even experienced researchers understand each other.
So, without further ado let’s jump in!
Pre-training
Compute governance
TLDR: Regulate companies in the highly concentrated AI chip supply chain, given AI chips are key inputs to developing frontier AI models.
Training AI systems requires applying algorithms to lots of data, using lots of computing power (also known as ‘compute’ - effectively the number of operations like addition or multiplication you can do). The algorithms are generally known,[2] and most of the data used is publicly available on the internet.
Computing power on the other hand is generally limited to far fewer actors: to be at the cutting edge you generally need thousands of AI chips which are incredibly costly (e.g. one chip costs $30-40k, and companies like Meta intend to buy hundreds of thousands of them). In addition to being very costly, there are very tight bottlenecks in the supply chain: NVIDIA designs 80-95% of AI chips, TSMC manufactures 90% of the best chips, and ASML makes 100% of EUV machines (used to create the best AI chips).[3]
Governments could therefore intervene at different parts of the compute supply chain to control the use of compute for training AI models. For example, a government could:
Compute governance is generally only proposed for large concentrations of the most cutting-edge chips. These chips are estimated to make up less than 1% of high-end chips.
Introductory resource: Computing Power and the Governance of AI
Data input controls
TLDR: Filter data used to train AI models, e.g. don’t train your model with instructions to launch cyberattacks.
Large language models are trained with vast amounts of data: usually this data is scraped from the public internet. By the nature of what’s on the internet, this includes information that you might not want to train your AI system on: things like misinformation, information relevant to building dangerous weapons, or information about vulnerabilities in key institutions.
Filtering out this information could help prevent AI systems from doing bad things. Work here might involve:
Introductory resources: Emerging processes for frontier AI safety: Data input controls and audits
Licensing
TLDR: Require organisations or specific training runs to be licensed by a regulatory body, similar to licensing regimes in other high-risk industries.
Licensing is a regulatory framework where organisations must obtain official permission before training or deploying AI models above certain capability thresholds. This could involve licensing the organisation itself (similar to banking) or approving specific large training runs (similar to clinical trials).
This would likely be only targeted at the largest or most capable models, with some form of threshold that would need to be updated over time. Softer versions of this might include the US government’s requirement to notify them when training models over a certain size.
Licensing would need to be carried out carefully. It comes with a significant risk of concentrating market power in the hands of a few actors, as well as limiting the diffusion of beneficial AI use cases (which themselves might be helpful for reducing AI harms, for example via societal adaptation). It might also make regulators overconfident, and enable firms to use their licensed status to promote harmful unregulated products to consumers (as has been seen in financial services).
Introductory resources: Pitfalls and Plausibility of Approval Regulation for Frontier Artificial Intelligence
On-chip governance mechanisms
TLDR: Make alterations to AI hardware (primarily AI chips), that enable verifying or controlling the usage of this hardware.
On-chip mechanisms (also known as hardware-enabled mechanisms) are technical features implemented directly on AI chips or related hardware that enable AI governance measures. These mechanisms would be designed to be tamper-resistant and use secure elements, making them harder to bypass than software-only solutions. On-chip mechanisms are generally only proposed for the most advanced chips, usually those subject to export controls. Implementations often also suggest combining this with privacy-preserving technologies, where a regulator cannot access user code or data.
Examples of on-chip mechanisms being researched are:
Introductory resources: Secure, Governable Chips, Hardware-Enabled Governance Mechanisms
Safety cases
TLDR: Develop structured arguments demonstrating that an AI system is unlikely to cause catastrophic harm, to inform decisions about training and deployment.
Safety cases are structured arguments that an AI system is sufficiently safe to train or deploy. This approach has been used in high-risk industries like nuclear power and aviation.
Safety cases start by defining the system, identifying unacceptable outcomes and justifying assumptions about deployment. They then break down the system into subsystems, and assess the risk from these subsystems and their interactions.
Third-party risk cases from a red team or auditor could argue against a safety case, and a regulator could use both arguments to come to a decision.
Introductory resources: Safety Cases: How to Justify the Safety of Advanced AI Systems
Post-training
Evaluations (aka “evals”)
TLDR: give AI systems standardised tests to assess their capabilities, which can inform the risks they might pose.
Evaluations involve giving AI systems standardised tests that help us evaluate their capabilities (these tests are known as ‘benchmarks’).[4] In the context of extreme risks, we might focus on:
For example, the WMDP benchmark is a set of multiple-choice questions that can be used to carry out a dangerous capability evaluation. The questions cover knowledge of biosecurity, cybersecurity and chemical security. An example biosecurity question:
These tests are usually automated so that they can be run cheaply. However, some tests also include human-assisted components - for example to judge model outputs.
There are many streams of work within evaluations:
Introductory resources: Model evaluation for extreme risks, A starter guide for Evals
Red-teaming
TLDR: Perform exploratory and custom testing to find vulnerabilities in AI systems, often engaging external experts.
Red-teaming involves testing AI systems to uncover potential capabilities, propensity for harm, or vulnerabilities that might not be apparent through evaluations.[5] While evaluations are usually more standardised, red-teaming is typically more open-ended and customised to the specific system being tested.
Red-teaming is more likely to involve domain experts, and encourage out-of-the-box thinking to simulate novel scenarios. The results of successful red-teaming often end up being turned into automated evaluations for future models.
Red-teaming is also often used on AI systems as a whole, given the deployment of AI systems is usually more unique than the model itself. For example, red-teaming might discover security vulnerabilities with ways particular tools are integrated into the environment.
Workstreams within red-teaming include:
Introductory resources: Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Third-party auditing
TLDR: Have external experts inspect AI systems safely and securely to verify compliance and identify risks.
Third-party auditing involves allowing independent experts to examine AI systems, their development processes, and operational environments. This can help identify potential risks and verify compliance with regulations.
However, third-party AI audits are currently difficult because:
Corresponding areas of work in third-party auditing include:
Introductory resources: Theories of Change for AI Auditing, Towards Publicly Accountable Frontier LLMs, Structured access for third-party research on frontier AI models
Post-deployment
Abuse monitoring
TLDR: Monitor for harmful applications of AI systems by analysing patterns of behaviour.
Abuse monitoring involves monitoring the usage patterns of AI systems to detect potential misuse or harmful applications. While input/output filtration might restrict or block individual queries, abuse monitoring looks for patterns in behaviour and might result in general account restrictions or termination.
Deciding whether a behaviour is suspicious is context dependent. A key input to this could be a customer’s risk profile, the output of a customer due diligence process.
This approach draws parallels with anti-money laundering (AML) and counter-terrorist financing (CTF)[6] practices in the financial sector, which usually focus on suspicious patterns rather than individual activities.
Work to be done here includes:
A lot of valuable work here likely involves identifying and copying what works well from other industries.
Introductory resources: Monitoring Misuse for Accountable ‘Artificial Intelligence as a Service’
Agent identification
TLDR: Develop mechanisms to identify and authenticate AI agents acting on behalf of various entities in a multi-agent AI ecosystem.
In the future, AI agents may act on behalf of different users, interacting with other agents and humans in complex ways.
It might become difficult for humans to understand what is going on. This would make it hard to resolve problems (such as unwanted interactions between systems), as well as hold people accountable for their actions.
For example, Perplexity already navigates the internet and fetches content for users automatically. However, it does so while pretending to be a normal Google Chrome browser and ignores requests from publishers using agreed-upon internet standards (original source, WIRED followup, Reuters followup).
Work to be done here includes:
Introductory resources: IDs for AI Systems
Customer due diligence
TLDR: Identify, verify, understand and risk assess users of AI systems. In conjunction with other interventions, this could be used to restrict access to potentially dangerous capabilities.
Financial services firms are expected to perform customer due diligence (CDD)[7] to prevent financial crime. This usually involves:
Identifying customers: usually asking customers for details that could uniquely identify them. This might also include asking for details about corporate structures like the ultimate beneficial owners of a company.
Verifying[8] customer identities: checking the person really exists and is who they say they are. This might involve reviewing ID documents and a selfie video.
Understanding customers: understanding who customers are, and how customers will use your services. This often combines:
Risk assessing customers: evaluating the information collected about the customer, and developing a risk profile. This might then determine what kinds of activity would be considered suspicious. This is usually done on a regular basis, and is closely linked to ongoing monitoring of customer behaviours. If something very suspicious is flagged, the firm reports this to law enforcement as a suspicious activity report.
This exact process could be ported over to users of advanced AI systems to develop risk profiles for users of AI systems. These risk profiles could then be used to inform what capabilities are available or considered suspicious, in conjunction with input/output filtration and abuse monitoring.
The step that is likely to look the most different is the risk assessment. For AI systems, this might be evaluating whether there is a reasonable justification for the user to access certain capabilities. For example:
Introductory resources: ???[9]
Human in the loop
TLDR: Require human oversight in AI decision-making processes, especially for high-stakes applications.
A human-in-the-loop (HITL) system is one where a human gets to make the final decision as to the action being taken. The loop here often refers to the decision-making process, for example the observe, orient, decide, act (OODA) loop. A human being in this loop means progressing through the loop relies on them, usually owning the ‘decide’ stage.
For example, imagine an AI system used for targeting airstrikes. A human-in-the-loop system would surface recommendations to a human, who would then decide on whether to proceed. The nature of this human oversight could range dramatically: for example, reviewing each strike in detail compared to skimming batches of a thousand.
Human-on-the-loop systems (also known as human-over-the-loop systems) take actions autonomously without requiring human approval, but can be interrupted by a human. In our airstrike targeting system example, a human-on-the-loop system could be one which selects targets and gives humans 60 seconds to cancel the automatic strike. This system would inform humans as to what was happening and give them a genuine chance to stop the strike.
Some systems might use a combination of these methods, usually when looking at the AI systems that make them up at different granularity. For example, an autonomous vehicle might have systems that are:
A human-in-the-loop will not be able to supervise all aspects of a highly capable system. This intervention is generally seen as either a temporary solution, or as one part of a wider system.
Scalable oversight is a subfield of AI safety research that tries to improve the ability of humans to supervise more powerful AI systems.
Introductory resources: Why having a human-in-the-loop doesn't solve everything, (not AI-specific) A Framework for Reasoning About the Human in the Loop[10]
Identifiers of AI or non-AI content
TLDR: Identify what content is and is not from AI systems. Some methods also identify the originating AI system or even user.
Being able to differentiate between AI and human-created content may be helpful for combating misinformation and disinformation. Additionally, being able to identify AI-generated content can help researchers understand how AI systems are being used.
Some methods could also enable attributing content to specific AI systems or users. This could increase the accountability of model providers and users, and enable redress for AI harms. However, this also brings about significant risks to privacy.
AI-content watermarking
This involves embedding markers into AI-generated content to enable its identification. This can be applied to any outputs of AI systems, including text, images, audio, and video.
Watermarking is generally not effective against actors who intentionally want to remove the watermarks. It’s nearly impossible to watermark text outputs robustly, especially short texts. It’s very difficult to robustly watermark other AI outputs.
It may still be helpful to watermark AI outputs for other purposes. For example, to:
There are a number of different approaches to watermarking:
Multiple approaches may be used at the same time, to provide greater resilience against transformations and comply with different standards.
All of the above techniques in their basic form can also be used to mark content as AI-generated when it isn’t. This could provide a “liar's dividend” to people who can claim content is fake and AI-generated when it shows them in a bad light.
To prevent falsely marking content as AI-generated, these claims can be signed by trusted parties.[11] Metadata and steganographic techniques can use cryptography to sign assertions that the image is AI-generated. For example, the C2PA standard is a metadata watermarking solution that supports cryptographically signing claims, and then verifying these claims.
Watermarking cannot currently be enforced at a technical level for open-weights models, and it’s unlikely to be possible in future. However, companies that release or host open-weights models could be required (through legislation) to only provide models that enable watermarking by default.
Introductory resources: A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking, A Survey on the Possibilities & Impossibilities of AI-generated Text Detection
Human-content watermarking
Similar to watermarking AI outputs, some systems may be able to watermark human-generated content.
The key difficulty here is that determining something has human origin is quite difficult. There are a few approaches that work in specific cases:
Introductory resources: What are Content Credentials and can they save photography?
Hash databases and perceptual hashing
Hash functions are one-way functions that usually take arbitrary inputs (like some AI-generated content) and output a short string that represents that content. For example,
hash(<some image>) = abcd
.Hash databases store the hashes (like
abcd
) to maintain records of known AI-generated content. AI companies could hash the content they generate and send it to centralised hash databases, before giving the content to users.This means when the user posts the image to social media, it can be passed through the same hash function, and the resulting
abcd
hash can be found in the database so we know it’s AI-generated.The same approach can be applied to human-generated content, i.e. storing a hash database of human-generated content. However, it’s harder to prove content was human-generated in the first place (compared to AI content, which AI companies can attest is actually AI-generated).
Hashing only works if the exact same content is posted: for example cropping or converting the file would result in a different hash.
Perceptual hashing is an improvement upon hashing that allows for the identification of similar content even if it has been slightly modified. Perceptual hash databases would enable matching up content that had been subjected to minor modifications.
However, there are several key problems with hash databases:
Introductory resources: An Overview of Perceptual Hashing
Content provenance
Content provenance (also known as ‘chain of custody’) focuses on recording how content has been created and updated over time. This provides much more detailed information than other methods (which are usually a more binary yes/no for being AI-generated).
The main standard for this is the Coalition for Content Provenance and Authenticity (C2PA), by the Content Authenticity Initiative (CAI). This is led by Adobe, and members include Google, Microsoft and OpenAI.
This stores detailed metadata about how an image has been created and modified. The first step in this chain might be AI-content or human-content watermarking. For a single image, this metadata might look something like:
Introductory resources: C2PA Explainer
Content classifiers
Content classifiers aim to directly identify existing AI content without special changes to the AI content itself. Usually, these are AI systems themselves, trained to distinguish between real and AI images (similar to a discriminator in a GAN).
These have poor accuracy on text content. Several stories have been written on how misusing the results of faulty classifiers can cause harm.
It seems likely that classifiers for images, videos and audio will be slightly more accurate - especially for content that has not intentionally tried to hide its origins. However it’s unclear how much more accurate these will be (and they’ll almost certainly not be perfect).
Introductory resources: Testing of detection tools for AI-generated text
Input/output filtration
TLDR: Review and filter prompts and responses to block, limit, or monitor usage of AI systems - often to prevent misuse.
Input/output filtration usually involves automatically classifying prompts and responses from deployed AI systems. For example, Microsoft automatically classifies inputs into a range of different harm categories.
NB: Input/output filtration focuses on the inputs and outputs when the model is deployed,[12] which differs from data input controls (which focus on inputs at training time). Some learnings will be transferable between both.
Input/output filtration could be useful for:
Introductory resources: Create input and output safeguards
Progressive delivery
TLDR: Gradually roll out AI systems to larger populations to monitor impacts and allow for controlled scaling or rollback if issues arise.
Progressive delivery involves incrementally deploying AI systems to larger user groups or environments. This allows developers and regulators to monitor the system as it scales, and the ability to slow, pause, or reverse the rollout if problems are detected. Additionally, these problems should hopefully be smaller in scale as fewer users would have access - and we'd have more time to implement societal adaptations. Progressive delivery is already very common in technology companies, to minimise the impact of new bugs or similar.
Developing guidance for this would include:
This is also known as canary releasing, staged roll-outs, or phased deployment.
Introductory resources: ???[9:1]
Responsible disclosure programmes
TLDR: Enabling reports from the public, such as users and researchers.
Responsible disclosure programmes are ways for the public to report issues with AI systems. Companies would be expected to review and action these reports.
Extensions to basic report management include incentivising such reports with rewards programmes (‘bug bounties’), and getting more value from reports by sharing learnings.
These are very similar to existing programmes often found in cybersecurity. Both programmes are also often known as ‘vulnerability reporting programmes’.
Reporting channels
TLDR: Provide ways for any external person or organisation to report an issue with an AI system.
To report problems, there needs to be a way for the public to make these reports. For example, AI companies could provide an email address or web form. The companies would then need to evaluate and action these reports appropriately.
Despite many AI companies signing voluntary commitments to set up these channels last year, most AI companies still have not set up the very basics. Multiple companies don’t even list an email or web form, and most companies don’t appear to review reports at all.
One idea to improve reporting accountability is to have all reports go through a central body. This would support users to provide better reports, and pass these reports on to AI companies. While doing this, it would keep records of all reports, giving visibility into the problems arising with different AI systems, as well as how companies are responding to them. This is very similar to a service for consumer issues in the UK called Resolver. This could also be set up by a government body, and would give them a lot of insight into what’s going wrong.
Introductory resources: ???[9:2]
Bug bounties
TLDR: Provide (usually monetary) rewards to people making genuine reports.
Bug bounty programs incentivize external researchers and ethical hackers to identify and report vulnerabilities in AI systems. These programs typically offer monetary rewards for valid reports (example from Google’s cybersecurity bug bounty program), with the reward amount often scaling with the severity of the issue.
As well as monetary incentives, other incentives often include public recognition, private recognition and company merch.
Introductory resources: AI Safety Bounties
Shared reports
TLDR: Share data about reports, usually publicly and after the issue has been fixed.
Sharing knowledge about problems helps the industry learn how to make AI systems safer.
This is incredibly common in cybersecurity, and it’s standard best practice to publicly disclose security problems to the MITRE Corporation, an independent non-profit that maintains a register of security vulnerabilities (known as their CVE program). This shared system is used by all major governments, including the US and UK.
This has also been very helpful in cybersecurity for understanding threats, and has resulted in the development of tools like ATT&CK and the OWASP Top Ten which both aim to understand how to better secure systems.
This would also likely help regulators understand what kind of problems are being observed, as well as how AI companies are responding.
Also see incident monitoring.
Introductory resources: ???[9:3]
Right to object
TLDR: Allow individuals to challenge decisions made by AI systems that significantly affect them, potentially requiring human review.
This right gives individuals the ability to contest decisions made solely by AI systems that have a substantial impact on their lives.
An existing provision that covers this already exists in the EU's General Data Protection Regulation (GDPR). Unfortunately, it’s often difficult to contest decisions in practice, and the contractual exemption can prevent users from challenging a lot of significant decisions. The EU AI Act will also give people a right to an explanation of the role of an AI system, and the main elements of a decision taking - but not a right to object or human review.
Similar rights are not present in many other regions. For example, the US does not have similar rights, but they have been proposed by the White House: “You should be able to opt out [from automated systems], where appropriate, and have access to a person who can quickly consider and remedy problems you encounter.”
Introductory resources: ???[9:4]
Transparency requirements
TLDR: Require disclosure of information about AI systems or AI companies, potentially submitting them to a central register.
Transparency requirements involve mandating that developers and/or deployers of AI systems disclose specific information about their companies or systems to regulators, users, or the general public. These requirements aim to increase understanding, accountability and oversight of AI systems.
This could take a number of forms. For models themselves, model cards offer a way to summarise key facts about a model. When end users are interacting with AI systems, transparency or disclosure requirements similar to those for algorithmic decision making in the GDPR might be more appropriate.[13]
This data might also be added to a central register to support regulator oversight, build public trust and help AI researchers - similar to the FCA’s financial services register, the ICO’s register of data controllers, or the CAC’s internet service algorithms registry (English analysis). Article 71 of the EU’s AI Act will set up a database to record high-risk AI systems.
Introductory resources: Model Cards for Model Reporting, Navigating the EU AI System Database
Cross-cutting
Analysis functions
Risk identification
TLDR: Identify future risks from AI systems.
Risk identification (also known as ‘horizon scanning’) involves identifying potential negative outcomes from the development and deployment of AI systems.
Introductory resources: International scientific report on the safety of advanced AI: interim report: Risks
Risk analysis
TLDR: Understand and assess risks from AI systems.
Risk analysis involves evaluating identified risks to understand their likelihood, potential impact, and contributing factors. This process typically includes:
It often will involve consulting with experts in different areas given the wide-ranging nature of a lot of risks.
An extension to this often involves modelling the impact different policies might have on risks, and therefore being able to evaluate the costs and benefits of different policies.
Introductory resources: NIST AI Risk Management Framework
Incident monitoring
TLDR: Investigate when things go wrong with AI systems, and learn from this.
Incident monitoring involves systematically tracking, analysing, and learning from failures, near-misses, and unexpected behaviours in AI systems. This process typically includes:
Also see whistleblowing and responsible disclosure programmes.
Introductory resources: Adding Structure to AI Harm
Open-source intelligence monitoring
TLDR: Use public information to monitor compliance with AI standards, regulations or treaties.
Open-source intelligence (OSINT) monitoring involves collecting and analysing public information. This could be used to:
Information sources for OSINT monitoring include:
Introductory resources: ???[9:5]
Semi-structured interviews
TLDR: Conduct regular interviews with employees from frontier AI companies to gain insights into AI progress, risks, and internal practices.
Semi-structured interviews have government officials interview employees at leading AI companies to gather qualitative information about AI development. These aim to capture developers' intuitions, concerns, and predictions about AI progress and risks.
This would inform government understanding of AI safety relevant concerns, and could provide early warning signs of potential risks. It could also enable insights into internal safety culture and practices.
Introductory resources: Understanding frontier AI capabilities and risks through semi-structured interviews
Audit logging
TLDR: Log AI system operations for real-time analysis and after-the-event investigations.
Audit logging involves recording key events and actions taken in relation to AI systems in a way that allows for retrospective analysis and verification. This helps promote regulatory compliance, prevent misuse, and investigate incidents involving AI systems so we can learn from them.
This might include:
Key desiderata for audit logging include:
Similar to customer due diligence and abuse monitoring it’s likely that a lot of valuable work here just involves identifying and copying what works well from other industries.
Introductory resources: Toward Trustworthy AI Development: Audit Trails, Visibility into AI Agents
Cyber and information security
TLDR: Establish and enforce cyber and information security measures for AI labs and systems to protect against various threats.
Standards, guidance and enforcement could help protect the confidentiality, integrity, and availability of AI-related assets.
Securing model parameters and other key intellectual property
Future AI systems may be highly capable and dangerous in the wrong hands. Adversaries like nation states or terrorist groups are likely to want to get their hands on these systems to enable them to pursue their own goals. Attackers might also want related intellectual property, like algorithms that improve the efficiency of training or running AI systems.
Despite the significant risks, ex-OpenAI staff have noted that current security measures are far from sufficient:
Introductory resources: Securing AI Model Weights
Securing model environments
We don’t currently know how to build AI systems that reliably try to do what their creators intend them to do. The way we currently build systems is may lead to AI takeover, and almost certainly some sycophancy and deception.
It currently seems plausible that we will train or deploy highly capable AI systems that might be trying to kill us.[14]
If we deploy these AI systems, we might want to sandbox or somehow limit the environments we deploy them into to reduce the damage they might cause.
To do this effectively, we’ll need to do a good job securing these environments themselves. This is because these AI systems might be very good at breaking security measures we put in place - and if they broke out, could cause catastrophic harm.
Introductory resources: ???[9:6]
Securing internal change management processes
It’s crucial that decisions about training or deploying AI systems at companies are appropriately authorised, and comply with other requirements around safety.
Change management security measures might look like:
Introductory resources: ???[9:7]
‘Traditional’ security concerns of AI systems
In the same way that we defend standard computer systems from attack, we’ll need to defend AI systems from similar attacks. Like standard computer systems, AI systems may be entrusted with sensitive information or control over resources.
AI systems also open up other avenues for attack. For example, OWASP have developed a top 10 list of attacks specific to large language models.
Introductory resources: Machine learning security principles
Securing other systems
AI systems are expected to increase the volume and impact of cyberattacks in the next 2 years. They’re also expected to improve the capability available to cyber crime and state actors in 2025 and beyond.
Open-weights models are likely to increase this threat because their safeguards can be cheaply removed, they can be finetuned to help cyberattackers, and they cannot be recalled. Given many powerful open-weights models have been released, it’s infeasible to ‘put the genie back in the bottle’ that would prevent the use of AI systems for cyberattacks.[15]
This means significant work is likely necessary to defend against the upcoming wave of cyberattacks caused by AI systems. Also see societal adaptation.
Introductory resources: The near-term impact of AI on the cyber threat
Policy implementation
Standards development
TLDR: Turn ideas for making AI safer into specific implementation requirements.
Standards development involves creating detailed, technical specifications for AI safety (or specific processes related to AI safety, like evaluations).
These could be hugely impactful, particularly as regulations often use standards to set the bar for what AI companies should be doing (e.g. CEN-CENELEC JTC 21 is likely to set what the actual requirements for general-purpose AI systems are under the EU AI Act). These standards can serve as benchmarks for industry best practices and potentially form the basis for future regulations.
Unfortunately, a lot of current AI standards development is fairly low quality. There is often limited focus on safety and little technical detail. Almost no standards address catastrophic harms. Standards are also often paywalled, effectively denying access to most policymakers and a lot of independent researchers - and even free standards are often painful to get access to.
There is definite room for contributing to AI standards to develop much higher quality safety standards. At the time of writing, several organisations have open calls for people to help develop better standards (including paid part-time work starting from 2 hours a week).
Introductory resources: Standards at a glance
Legislation and regulation development
TLDR: Turn policy ideas into specific rules that can be legally enforced, and usually some powers to enforce them.
Similar to standards development, regulations development involves turning policy ideas into concrete legislation and regulations.
As well as selecting policies to implement, this also includes:
Introductory resources: 2024 State of the AI Regulatory Landscape, (not AI-specific) The Policy Process, (not AI-specific) The FCA’s rule review framework
Regulator design
TLDR: Design effective structures and strategies for regulatory bodies.
Legislation and regulation often leaves a lot open to interpretation by a regulator. Even where it doesn’t, how the regulator structures itself and what it focuses on can have a significant impact on the effectiveness of policies.
Regulator design involves structuring regulatory bodies, and informing their regulatory strategy, which determines what they focus on and how they use their powers.
For example, the ICO has powers to impose fines for violations of data protection law. However, it almost never uses these powers: despite 50,000 data breaches being self-reported by companies as likely to present risks to individuals,[16] it has issued 43 penalties since 2018 (0.08%). It’s also often highlighted that data protection regulators are under-resourced, don’t pursue important cases, and fail to achieve regulatory compliance - all of these are to do with regulator design and implementation (although regulations can be designed to mitigate some of these failures).
In AI regulation, it’ll also likely be hard to attract and retain necessary talent. Governments pay far less and have far worse working conditions than AI companies. They also often have poor hiring practices that optimise for largely irrelevant criteria. This said, the opportunity for impact can be outstanding (depending on the exact team you join) and these problems make it all the more neglected - and you could also work on fixing these operational problems.
Introductory resources:[9:8] (not AI-specific) Effective Regulation
Societal adaptation
TLDR: Adjust aspects of society downstream of AI capabilities, to reduce negative impacts from AI.
Societal adaptation to AI focuses on modifying societal systems and infrastructure to make them better-able to avoid, defend against and remedy AI harms. This is in contrast to focusing on capability-modifying interventions: those which affect how a potentially dangerous capability is developed or deployed. Societal adaptation might be useful as over time it is easier to train AI systems of a fixed capability level, so capability-modifying interventions may become infeasible (e.g. when anyone can train a GPT-4-level model on a laptop).
Societal adaptation is generally presented as complementary to capability-modifying interventions, and not a replacement for them. One route might be to use capability-modifying interventions to delay the diffusion of dangerous capabilities to ‘buy time’ so that effective societal adaptation can occur.
An example of societal adaptation might be investing in improving the cybersecurity of critical infrastructure (energy grids, healthcare systems, food supply chains).[17] This would reduce the harm caused by bad actors with access to AI systems with cyber offensive capabilities.
Introductory resources: Societal Adaptation to Advanced AI
Taxes
TLDR: Change tax law, or secure other economic commitments.
Windfall clauses
TLDR: Get AI companies to agree (ideally with a binding mechanism) to share profits broadly should they make extremely large profits.
AI might make some companies huge amounts of money, completely transforming the economy - probably easily capturing almost half of current wages in developed countries. This might also make a lot of people redundant, or decimate many existing businesses.
It’s extremely difficult to predict exactly how this will pan out. However, if it does result in AI companies making extremely large profits (e.g. >1% of the world’s economic output) we might want to tax this heavily to share the benefits.
Introducing such a tax after the money has been made will likely be much harder than encouraging companies to commit to this beforehand. This is kind of similar to Founders Pledge, which encourages founders to commit to donate a percentage of their wealth should they exit successfully.
Introductory resources: The Windfall Clause, or the associated talk on YouTube
Robot tax
TLDR: Taxes to discourage the replacement of workers with machines.
Robot taxes (also known as automation taxes) are proposed levies on the use of automated systems or AI that replace human workers. The main goal of such a tax is usually to slow the pace of workforce automation to allow more time for adaptation.
It’s highly uncertain what changes we might want to slow down. In many areas this will trade-off against valuable advancements. Reasonable people disagree as to whether it’s a good idea or not.
It also seems unlikely that this will address some of the most neglected or serious AI risks, beyond slightly decreasing incentives to build AI systems generally. A more targeted form that might be more useful are different robot tax rates depending on system risk, ensuring that this includes catastrophic risks.
Introductory resources: Should we tax robots?
Other taxes
TLDR: Penalties or tax breaks to incentivise ‘good’ behaviours (like investing in AI safety research).
Tax penalties or other government fines can be used to encourage compliance:
Tax incentives can be used to encourage AI companies and researchers to prioritise safety research, or building safe AI systems. This might include:
Introductory resources: ???[9:9]
Whistleblowing
TLDR: Enabling reports from people closely related to AI systems, such as employees, contractors, or auditors.
Legal protections
TLDR: Legal protections for people reporting AI risks, or non-compliance with AI regulations.
Whistleblowing protections enable individuals or organisations to report potential risks or regulatory violations without fear of retaliation.
Current whistleblower laws don't clearly cover serious AI risks. They also tend to only cover domestic employees (and not contractors, external organisations, auditors, or any overseas workers). For example, the UK’s Public Interest Disclosure Act 1998 primarily covers existing legal obligations,[18] which do not exist in AI as the UK government has not legislated yet.
For example, OpenAI researcher Daniel Kokotajlo expected to lose $1.7 million, or 85% of his family’s net worth, by refusing to sign an agreement that would permanently bar him from reporting concerns to a regulator.
In response, OpenAI CEO Sam Altman claimed to not know this was happening. Later leaks revealed that Altman had signed off on the updated legal provisions the year before.
These leaks also further revealed the use of other high pressure tactics including requiring employees to sign exit agreements within 7 days, making it hard to obtain legal advice. The company pushed back on requests to extend this deadline when asked to by an employee so they could receive legal advice, and in the same thread again highlighted “We want to make sure you understand that if you don't sign, it could impact your equity”.
(OpenAI’s exit contracts were later changed to resolve this specific issue, but the underlying ability to do this remains.)
Whistleblower protections might include:
Introductory resources: AI Whistleblowers
Reporting body
TLDR: Design or set up an organisation to accept reports from whistleblowers.
If people weren’t at legal risk for reporting concerns about AI systems, practically there’s no obvious body to do this to. Work here could include designing one, setting it up, or campaigning for one to exist.
In the UK, there’s no prescribed person or body that fits well with AI harms. The closest available option might be the Competition and Markets Authority (CMA), or maybe The Secretary of State for Business and Trade.
If you didn’t care about reporting it legally,[19] the AI Safety Institute (AISI) or their parent Department for Science, Innovation & Technology (DSIT) might be reasonable. But neither has public contact details for this, no operational teams to deal with such reports, and no investigatory or compliance powers. The Office for Product Safety & Standards (OPSS) might also be relevant, although they don’t generally accept whistleblowing reports, and primarily focus on the safety of physical consumer goods (although have recently worked on the cybersecurity of IoT devices).
Also see the section on regulator design.
Introductory resources: ???[9:10]
This is the end of the list of practices
Closing notes
Related articles
The closest existing work is probably either:
This round-up is also partially inspired by seeing how useful articles in the AI alignment space have been in getting new people up to speed quickly, including:
It also expands upon prior work summarising high-level activities in the AI governance space, including:
Acknowledgements
This was done as a project for the AI Safety Fundamentals governance course (I took the course as a BlueDot Impact employee to review the quality of our courses and get a better international picture of AI governance). Thanks to my colleagues for running a great course, my facilitator James Nicholas Bryant, and my cohort for our engaging discussions.
Additionally, I’d like to thank the following for providing feedback on early drafts of this write-up:
Other notes
An early version of this article listed organisations and people working on each of these areas. It was very difficult to find people for some areas, and many people objected to being named. They would also likely go out of date quickly. In the end, I decided to just not list anyone.
Reusing this article
All the content in this article may be used under a CC-BY licence.
Citation
You can cite this article with:
Ultimately, the aim is to reduce expected AI risk.
AI governance is likely to be a key component of this, and requires relevant organisations (such as governments, regulators, AI companies) to implement policies that mitigate risks involved in the development and deployment of AI systems. However, several connections I have in AI governance have expressed frustration at the lack of research done on available practices.
High-quality policy research would therefore benefit AI governance. However, there’s a lot of noise in this space and it’s common for new contributors to start down an unproductive path. I also know many people on the AI Safety Fundamentals courses (that I help run) are keen to contribute here but often don’t know where to start.
I expect by writing and distributing this piece, people new to the field will:
For example, all LLMs use the transformer architecture - and most progress in AI has just been chucking more compute and data at these models. This architecture was made public in a 2017 paper by Google called “Attention Is All You Need”. ↩︎
A common confusion is the difference between the roles in the semiconductor supply chain.
A house building analogy I’ve found useful for explaining this to people: ASML act as heavy equipment suppliers, who sell tools like cranes, diggers or tunnel boring machines. TSMC act as builders, who use ASML’s tools to build the house following NVIDIA’s plans. NVIDIA act like architects, who design the floor plans for the house.
And specifically that translates to: ASML build the machines that can etch designs into silicon (known as EUV photolithography). TSMC sets up factories (also known as fabrication plants, fabs or foundries), buys and instals the machines from ASML (among other equipment), and sources raw materials. NVIDIA create the blueprints for what should be etched into the silicon. They then order TSMC to manufacture chips on their behalf with their designs.
The above is a good initial mental model. However, in practice it’s slightly more complex: for example NVIDIA is also involved with some parts of the manufacturing process, and NVIDIA rely on specialised tools such as EDA software (similar to how architects use CAD tools). ↩︎
The definition I’ve used here is a definition I see commonly used, but there are multiple competing definitions of evals. Other definitions are broader about assessing models. I’ve used this narrow definition and split out other things that people call evals, but also give different names, such as red-teaming or human uplift experiments.
Regarding other definitions:
Most red-teaming at the moment assumes unaided humans trying to find vulnerabilities in models.
I expect in the future we’ll want to know how humans assisted by tools (including AI tools) are able to find vulnerabilities. This overlaps with topics such as adversarial attacks on machine learning systems.
It’s currently unclear though how the definition will change or what terms will be used to refer to versions of red-teaming that also cover these issues. ↩︎
Also known as countering/combating the financing of terrorism (CTF). ↩︎
This set of activities is also often referred to as KYC (know your customer, or know your client).
Some people argue about what is considered KYC vs CDD, and reasonable people use slightly different definitions. Most people use them interchangeably, and the second most common use I’ve seen is that KYC is CDD except for the risk assessment step. ↩︎
Combined with identification, this is often known as ID&V. In the US, this is sometimes referred to as a Customer Identification Program (CIP). ↩︎
I could not find good introductory articles on this topic that focused on governing advanced AI systems.
I'm open to feedback and recommendations if you know of any. Some minimum requirements:
This focuses on a human-in-the-loop in the context of security critical decisions, but the framework appears to work equally as well for significant AI decisions. ↩︎
Anyone could sign a claim, from ‘OpenAI’ to ‘random person down the street using an open-weights model’.
Which of these claims should be trusted is an open question. However, we have solved these kinds of problems before, for example with SSL certificates (another cryptographic scheme, used so browsers can encrypt traffic to websites) we designated a few bodies (known as root certificate authorities) as able to issue certificates & delegate this power to others. Then tools like web browsers have these root certificates pre-installed and marked as trusted. ↩︎
Also known as ‘at runtime’ or ‘at inference time’, and as opposed to ‘during training’. ↩︎
Specifically, GDPR Article 13 paragraph 2(f). ↩︎
For what it’s worth, I’d prefer that we didn’t do this. But others disagree. ↩︎
Although it is possible to not release more powerful models that uplift people conducting cyberattacks further. ↩︎
Which is likely fewer than the total number of breaches, and certainly far fewer than the number of data protection law violations. ↩︎
We might also use new technology to help us with this, including AI itself. Approaches towards accelerating the development of these defensive technologies is known as defensive accelerationism, def/acc or d/acc.
NB: These terms are new and often mixed with other concepts - the d in d/acc can mean defensive, decentralised, differential or democratic - sometimes all at the same time. ↩︎
It’s arguable whether the other bases for protected disclosures could apply here, but it’d likely result in an expensive court battle (that would almost certainly be in the interests of AI companies with billions of dollars of investment, and probably not in for one individual who might have already been stripped of a lot of their income).
The potentially relevant bases are listed in section 1 of the Public Interest Disclosure Act, inserted as section 43B of the Employment Rights Act 1996. These include:
I am not a lawyer and if you’re planning to go to battle with a company with billions to spend I’d recommend getting better advice than a footnote in a blog post by someone who doesn’t know your specific situation. Protect might be a good starting point for those based in the UK. ↩︎
There is an exception in the Public Interest Disclosure Act, inserted into the Employment Rights Act as section 43G, which roughly allows you to report it to another body if it’s considered reasonable subject to a bunch of other conditions. ↩︎