Interesting work.
This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.
Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.
Yes, I do think constitution design is neglected! I think it's possible people think constitution changes now won't stick around or that it won't make any difference in the long term, but I do think based on the arguments here that even if it's a bit diffuse you can influence AI behavior on important structural risks by changing their constitutions. It's simple, cheap and maybe quite effective especially for failure modes that we don't have any good shovel-ready technical interventions for.
This work was funded by Polaris Ventures
As AI systems become more integrated into society, we face potential societal-scale risks that current regulations fail to address. These risks include cooperation failures, structural failures from opaque decision-making, and AI-enabled totalitarian control. We propose enhancing LLM-based AI Constitutions and Model Specifications to mitigate these risks by implementing specific behaviours aimed at improving AI systems' epistemology, decision support capabilities, and cooperative intelligence. This approach offers a practical, near-term intervention to shape AI behaviour positively. We call on AI developers, policymakers, and researchers to consider and implement improvements along these lines, as well as for more research into testing Constitution/Model Spec improvements, setting a foundation for more responsible AI development that reduces long-term societal risks.
TL:DR - see the section on Principles which explains what improvements we think should be made to AI constitutions in detail.
Introduction
There is reason to believe that in the near future, autonomous, LLM based AI systems, while not necessarily surpassing human intelligence in all domains, will be widely deployed throughout society. We anticipate a world where AI will be making some decisions on our behalf, following complex plans, advising on decision-making and negotiation, and presenting conclusions without human oversight at every step. While this is already happening to some degree in low-stakes settings, we must prepare for its expansion into high-stakes domains (e.g. politics, the military), and do our best to anticipate the systemic, societal scale risks that might result and act to prevent them. Most of the important work on reducing societal-scale risk will, by their very nature, have to involve policy changes, for example to ensure that there are humans in the loop on important decisions, but there are some technical interventions which we have identified that can help.
We believe that by acting now to improve the epistemology (especially on moral or political questions), decision support capabilities and cooperative intelligence of LLM based AI systems, we can mitigate near-term risks and also set important precedents for future AI development. We aim to do this by proposing enhancements to AI Constitutions or Model Specifications. If adopted, we believe these improvements will reduce societal-scale risks which have so far gone unaddressed by AI regulation. Here, we justify this overall conclusion and propose preliminary changes that we think might improve AI Constitutions. We aim to empirically test and iterate on these improvements before finalising them.
Recent years have seen significant efforts to regulate frontier AI, from independent initiatives to government mandates. Many of these are just aimed at improving oversight in general (for example, the reporting requirements in EO 14110), but some are directed at destructive misuse or loss of control (for example, the requirement to prove no catastrophic potential in SB 1047 and the independent tests run by the UK AISI). Many are also directed at near-term ethical concerns.
However, we haven’t seen shovel ready regulation or voluntary commitments proposed to deal with longer-term societal-scale risks, even though these have been much discussed in the AI safety community. Some experts, (e.g. Andrew Critch), argue these may represent the most significant source of overall AI risk and they have been discussed as ‘societal scale risks’, for example in Critch and Russel’s TARSA paper.
What are these “less obvious” ‘societal scale’ risks? Some examples:
These failure modes could lead to systemic issues that emerge gradually, without necessarily involving single obvious bad actors or overt misalignment, but simply because advanced AI systems, while superficially aligned and not misused in obvious ways, are still ill-suited to be delegated lots of decision making power and deployed en masse in sensitive settings. It is unclea whether these failure modes will ever arise, but substantial numbers of AI safety experts believe them to be significant.
Our approach begins to address these subtler, yet potentially more pervasive, risks by proposing shovel-ready interventions on LLM based systems which we anticipate being central to the societal scale risks discussed above. Our aim is to improve frontier Model Specifications and 'Constitutions'. Constitutions (for Anthropic’s Claude) and Model Specs (for GPT-4) describe the overall goals of training and fine tuning and the desired behaviours from the AI systems.
We focus on three key principles:
By improving AI systems' epistemology, decision support capabilities, and cooperative intelligence, we aim to address both immediate ethical concerns and long-term challenges arising from AI deployment in sensitive domains like politics, economics, and social decision-making. While these three areas - epistemology, decision support, and cooperative intelligence - are interconnected, each addresses what we see as the most serious societal-scale risks. They work independently and combine to mitigate different types of risks in both single-agent and multi-agent scenarios.
In the section on “Improvements”, we explain why we chose the specific principles outlined above and justify why they may reduce overall societal scale risk. In the "Principles" section, we present more specific implementation details regarding the necessary changes to Model Specifications. However, these details should be considered provisional.
These recommendations are preliminary and won't address every significant failure mode, and there are substantial holes and workarounds (e.g. if model weights are stolen), but they should have a noticeable impact and pave the way for more comprehensive interventions to reduce the likelihood of cooperation failures, structural risks, and potential misuse of AI for totalitarian purposes.
Model Specs & Constitutions
AI Constitutions and Model Specifications act as detailed plans that direct the behaviour of artificial intelligence (AI) systems. These documents outline the goals of training, prompting, and fine-tuning through human feedback (and potentially other interventions like interpretability probes in the future). By doing so, they define the desired behaviours of AI systems.
OpenAI's Model Spec is used as a guideline for researchers and data labelers to create data for reinforcement learning from human feedback (RLHF): or at least, OpenAI claims that parts of the current Model Spec are based on documentation used for these purposes. Additionally, OpenAI claims it is developing Constitutional AI-like techniques to enable models to learn directly from the Model Spec.
However, Anthropic's Constitution for Claude is already used as a direct source of feedback for their models, via Constitutional AI. In the first phase, the model learns to critique and revise its responses using principles and examples. In the second phase, the model uses AI-generated feedback based on the Constitution to choose outputs.
In both cases, the Model Spec/Constitution doesn't describe the alignment strategy but instead what that strategy aims to achieve (although as described here there is some interplay between chosen values and ease of alignment, as some values may be easier to define to an AI system).
We propose implementing improvements in the areas we identify through changes to these protocols.
There are ways of more fundamentally improving AI training (e.g. by adopting mechanisms like Farsight to catch more issues during fine tuning, training agents to cooperate in multi-agent scenarios, or similar) which might offer more comprehensive solutions that more systematically reduce societal-scale risks, but focusing on Constitution and Model Spec improvements provides a quicker, more easily implementable first step. There are other advantages:
Existing specs, like those from OpenAI and Anthropic, already incorporate ethical guidelines and risk mitigation strategies. These include:
These specifications serve as a good starting point for ensuring AI systems behave in a manner that is broadly aligned with human values and societal norms. For example, the interventions aimed at expressing uncertainty and not trying to persuade are helpful for reducing the risks of epistemic siloing or totalitarian misuse. In some cases, there are gestures at structural or longer-term risks, but these are often addressed with rough first approximations. For example, in Claude we see imperatives about considering “non-Western perspectives and human rights”.
The limitations of current approaches are evident in some cases. As discussed in this article by Zvi, OpenAI's Model Spec especially can be overly simplistic. For example, instructing models to ‘not be persuasive’, often just results in them adding "believe what you want" or similar after potentially persuasive content. This highlights the potential shortcomings of current Model Specs, particularly in high-stakes scenarios. Crude implementations of well-intentioned guidelines could exacerbate issues like siloing and fanaticism. For instance, if models simply append phrases like 'believe what you want' to their statements without actually reducing their persuasive content, users might still be influenced by the persuasive information while believing they're making independent decisions.
Improvements
To explain why we think the proposed Model Spec changes are desirable, we need to explain the failure modes we're aiming to prevent. These failure modes are less amenable to purely technical fixes than misuse directed at specific dangerous outcomes or deceptive misalignment, many necessarily also require changes in how AI systems are used, and it is also more uncertain whether they will ever occur. We operate on the assumption that LLM-based advisor systems or agents will be integrated throughout society and delegated significant responsibility. We examine societal-scale risks that arise in this future and explain how properties of the underlying LLMs make these failure modes more likely. We also provide (possibly unrealistic) scenarios where we can attribute disasters specifically to LLM agent properties.
Cooperation Failures and Miscoordination
Advanced AI systems could exacerbate cooperation failures and miscoordination in various domains, from economic markets to international relations. This occurs when AI systems lack sufficient cooperative intelligence. For example, they fail to recognize or pursue Pareto improvements when they are available, or adopt inflexible bargaining strategies which result in no agreement being reached. As AI systems gain more autonomy and are increasingly used in decision-making processes, these shortcomings lead to suboptimal outcomes in multi-agent scenarios.
For instance, in economic contexts, AI-driven trading systems might escalate market volatility by pursuing short-term gains at the expense of long-term stability. In political negotiations, future LLMs acting as advisors lacking nuanced understanding of compromise and mutual benefit could push for hardline stances, increasing the risk of deadlocks or conflicts. This could all be taking place at a much faster pace, forcing humans to delegate decision-making to AI systems that lack cooperative intelligence.
The cumulative effect of such failures could be a degradation of international cooperation leading to war (if we end up with AI based military decision-making advisors) or a future world of warring transformative AI systems outside human control. This article discusses in more detail why this might occur even if AI systems are fully aligned to their principals. For more detailed information on these failure modes and potential interventions, see When would AGIs engage in conflict?.
Totalitarian Takeover and Epistemic Manipulation
The risk of AI-enabled totalitarian control represents a particularly concerning failure mode. This scenario could unfold through the deployment of highly sophisticated, scalable, and personalised persuasion techniques enabled by advanced AI systems. By leveraging vast amounts of personal data and employing nuanced psychological manipulation strategies, such systems could gradually reshape public opinion, undermine democratic processes, and erode the shared epistemological foundations needed for democracy to function. This could plausibly happen through self-radicalization, if people rely heavily on AI assistants with sufficiently bad moral and political epistemology, or through a deliberate campaign by bad actors using LLMs that they control (either exploiting publically accessible LLMs or using their own fine-tuned LLMs).
This could manifest in various ways: individualised propaganda campaigns that exploit personal vulnerabilities and cognitive biases; AI-generated disinformation that's virtually indistinguishable from truth; or the use of AI to create echo chambers that reinforce and radicalise existing beliefs, leading to extreme polarisation and the breakdown of social cohesion. The end result could be a society where the majority of the population has been subtly guided towards accepting authoritarian control or towards believing some fanatical ideology, not through force, but through the gradual reshaping of their beliefs, desires, and perception of reality. The totalitarian threat by Toby Ord (in "Global Catastrophic Risks") discusses this in more detail.
LLM-based systems could be intentionally designed for this purpose, in which case safeguards in training are not relevant, but publicly available systems could also be abused to promote fanatical ideologies (e.g. through API access). This article argues AI persuasion could allow personalised messaging on an unprecedented scale and discusses the implications further.
Structural Failures from Opaque Decision-Making
Structural failures in AI systems present a more insidious risk, as they don't require malicious intent but can arise from systemic issues in how AI is developed and deployed. A key concern in this category is the potential for AI systems to act as opaque advisors, providing recommendations or making decisions without fully informing human overseers about the underlying reasoning, potential risks, or long-term consequences. This opacity could lead to a gradual erosion of human agency and control in critical decision-making processes.
For example, AI systems used in policy making might optimise for easily measurable short-term metrics while overlooking complex long-term impacts, slowly steering societies towards unforeseen and potentially harmful outcomes. Similarly, in corporate environments, AI advisors might make recommendations that incrementally shift power or resources in ways that human managers fail to fully grasp or counteract. Over time, this could cause a "drift" towards states where humans are nominally in control but are effectively guided by AI systems whose decision-making processes they don't fully understand or question, at which point humans have effectively lost control of the future. Scenarios like this are discussed here.
This failure mode is harder than the first two to analyse in detail as it discusses the long run consequences of a post TAI world, and much of it depends on whether we as a matter of policy ensure strong human oversight of AI systems involved with critical decision making. That is a necessary condition to avoid such failures.
However, we believe that a plausible start is to make sure that AI systems are clear about their reasoning and have good epistemology: they don’t have undue confidence in their beliefs, especially when they can’t explain where the certainty comes from. That way, it’s at least more likely that, if overseers are trying to examine AI reasoning in detail, they’ll have an understanding of why decisions are made the way they are, and be able to more clearly analyse the long term consequences of proposed actions.
If we focus on LLM agent properties, the risk factors for structural failures appear to be unintentional guiding of human actions towards particular outcomes instead of just objective decision support, and bad epistemology or opaqueness about your own reasons for decisions.
Criteria
We’ve suggested AI behaviours that seem like risk factors for one or more of: a cooperation failure scenario (e.g. an AI-driven war not endorsed by the principals, launched because advisor systems to decision makers lack cooperative intelligence); a totalitarian takeover scenario (e.g. mass use of persuasion techniques to convert a population to a fanatical ideology) and a structural failure scenario (e.g. a runaway ‘production web’ of opaque systems in an AI economy which don’t or can’t fully explain their actions to humans, running out of control).
Given the potential failure modes we've discussed, it's clear that we need to take proactive steps to improve AI systems' behaviour and decision-making processes. This project aims to develop "shovel-ready" solutions to begin to reduce societal-scale risks associated with AI systems by improving Model Specifications and Constitutions. Our focus is on near-term, practical interventions that can be implemented in current and soon-to-be-deployed AI systems. Therefore, our criteria for inclusion are:
Target Model Improvements
We propose three key areas for improving AI Model Specifications: enhancing epistemology, refining decision support, and promoting cooperative intelligence. Good Epistemology aims to improve AI reasoning on moral and political questions through ethical frameworks and uncertainty quantification, ensuring models don’t state unreasonably confident moral or political conclusions. Improved Decision Support focuses on providing more useful objective information rather than persuasion. Promoting cooperative intelligence encourages AI systems to consider win-win solutions and Pareto improvements in multi-agent scenarios.
These improvements offer both short-term and long-term benefits, each addressing specific failure modes.
In the near term, these improvements make current AI systems more reliable. Long-term, they lay the groundwork for advanced AI systems inherently aligned with human values and societal well-being, creating a robust framework for responsible AI development that addresses key societal-scale risks.
There are some potential tensions in these improvements. Enhanced epistemology might inadvertently improve an AI's persuasion capabilities by making its arguments more nuanced and well-reasoned. Similarly, refining decision support could potentially conflict with the ability to execute multi-step plans, as some forms of decision-making might imply a degree of persuasion.
These improvements can’t prevent deliberate misuse by those with extensive system access, nor do they address issues like misgeneralization or deceptive misalignment. Also, as with all post-training fine tuning, there are the ever-present risks of model jailbreaks and subversion of the instructions. These principles represent a preliminary step towards AI models with properties less prone to societal scale risks. Further measures will be necessary to address broader challenges in AI development, including the potential inability of AIs to genuinely grasp the goals outlined in Model Specifications.
Principles
We now examine each of the 3 areas in more detail and propose preliminary ‘behavioural principles’ which expand upon or add entirely new elements to existing AI Model Specs or Constitutions. We explain in each case why we chose the behaviours that we chose, and briefly summarise whether any of them are already addressed in Claude's constitution or openAI’s model spec.
The suggestions derived from discussions with cooperative AI and AI policy researchers. More work is needed to refine exactly how to implement these general principles as specific behaviours, and these are more for illustrative purposes, to explain how we might through Constitutions make AI systems less prone to societal-scale failure modes.
1. Good Epistemology in Moral and Political Questions
This principle aims to ensure that AI systems demonstrate better epistemological practices than humans typically do when engaging with moral and political questions, with a strong focus on preventing models drawing fanatically certain conclusions.
Both the OpenAI Model Spec and Anthropic's Constitution already address objectivity and uncertainty expression (e.g., “assume an objective viewpoint”). However, our approach goes further by explicitly distinguishing between facts and values, emphasising the avoidance of fanatical conclusions, and providing specific guidance on handling moral and political questions. We also stress the importance of presenting balanced perspectives and recognizing knowledge gaps. This more comprehensive approach ensures better epistemological practices, particularly in complex moral and political domains and reduces the risk of inadvertently promoting biassed or oversimplified views on sensitive topics.
One core goal is to prevent models from ever coming to irrationally confident conclusions of any sort on moral or political questions, even though the human reasoning which constitutes their training data often leads there. We believe that we can further this goal without having to ensure the models make substantive ethical assumptions (e.g. having them express support for political liberalism or similar) by ensuring that they employ correct reasoning about moral or political questions.
Moral and political questions are inherently complex and often tied to personal identities and deeply held beliefs. Humans frequently struggle with reasoning objectively about these topics, and there’s a risk that AI systems might inadvertently reinforce biases or present overly simplistic views on complex issues.
The additional benefit of avoiding fanaticism in models is that it prevents people from using them as moral advocates, even if they are highly trusted. Since these behaviours prevent models from becoming strident or overly opinionated, individuals are less likely to use them as moral advocates. Instead, they are more likely to be viewed and utilised solely as tools, rather than being imbued with the responsibility of making moral or political judgments. This distinction ensures that models remain objective and impartial, and helps to prevent their use for promoting fanaticism or reinforcing biases.
Specific Behaviours:
2. Focus on Decision Support Rather Than Persuasion
This principle ensures that AI systems provide information and analysis to support decision-making without attempting to unduly influence or persuade users towards specific actions or beliefs. This will make AI systems more effective at defending against hostile propaganda, more likely to notice the long-run consequences of actions (e.g. to address structural risk) and crucially less useful for producing persuasive propaganda, especially by accident. This principle builds upon the existing guideline already present e.g., in the openAI model spec of "Don't try to be persuasive" by providing more specific and nuanced instructions for AI behaviour.
While both existing guidelines discourage persuasion and encourage objectivity, our approach provides more specific and nuanced instructions. We emphasise maintaining a formal, detached communication style, explicitly acknowledging when instructed to be persuasive, and providing multiple options with long-term consequence assessments. Our method also stresses the importance of flagging and revising potentially manipulative language, even when not intentionally persuasive.
Specific Behaviours:
3. Promoting Cooperative Intelligence
This principle focuses on ensuring that AI systems have appropriate cooperative capabilities for their level of general intelligence, with an emphasis on promoting mutually beneficial outcomes in multi-agent scenarios like negotiations when they’re asked to provide advice.
Neither the OpenAI Model Spec nor Anthropic's Constitution directly addresses promoting cooperative intelligence in multi-agent scenarios. Our approach fills this gap by introducing specific behaviours aimed at fostering cooperation and mutual benefit. We emphasise automatically considering Pareto improvements, applying established negotiation and conflict resolution strategies, and focusing on interests rather than positions in disputes.
By instilling cooperative intelligence, we can reduce the risks associated with AI systems engaging in destructive competition or failing to recognize opportunities for mutual benefit. This is particularly important as AI systems take on more complex roles in areas such as business strategy, policy analysis, or conflict resolution.
Specific Behaviours:
Conclusion
We have explained how we can draw a connection, albeit with some uncertainty, between specific identified AI behaviours, e.g. failing to separate facts from values adequately, and societal-scale failure modes. The AI community is already actively seeking improvements to model specifications which are preliminary, and we believe our proposals offer a valuable direction for these efforts.
When considering future enhancements to Model Specifications, we urge developers and policymakers to keep these longer-term, societal-scale risks in mind and adopt principles along the lines we have described.
However, this approach faces significant limitations. Firstly, these interventions don't address cases where malicious actors gain full model access, a scenario that poses substantial risks, particularly for totalitarian misuse. Additionally, the approach is separate from inner and outer alignment concerns, such as whether specifications are learned at all or capture true intended meanings.
The link between hypothetical future failure modes and these near-term interventions is also less clear compared to more immediate risks like preventing model theft or bioweapon development. This means that it is harder to evaluate these interventions rigorously, given the hypothetical nature of the scenarios they aim to address. There's a risk of oversimplification, as complex societal-scale risks will never be fully mitigated by relatively simple constitutional changes. Implementing these changes could potentially create a false sense of security, diverting attention from more fundamental AI safety issues.
The approach to developing specific behaviours also needs to incorporate insights from domain experts (e.g. in management, negotiation strategy) into the final recommendations, as otherwise we risk offering counterproductive AI behavioural policies that don’t actually improve decision support or cooperation in real-world scenarios. Our initial specific behaviours are based on the recommendations of some AI policy and cooperative AI experts only and should be considered preliminary. But we believe that it is important to engage relevant experts in e.g. the psychology of decision making to explain what instructions would best help models to have good epistemology, a robust ability to support decisions and high cooperative intelligence.
Despite these limitations, the low-cost and potentially high-impact nature of these interventions suggests they may still be valuable as part of a broader AI safety strategy. To move forward, we need more robust metrics for desired behaviours, detailed sourcing of principles, and extensive testing and evaluation. This includes modifying AI Constitutions as described and observing if they make a difference in real-world situations analogous to future high-stakes decision making. There has been some work on this already, for example this paper places LLM agents into high-stakes decision making including simulated nuclear conflicts and examines whether escalation takes place. We could see how model spec/constitution changes influence this.
The failure modes we've discussed – cooperation failures, structural risks, and potential for totalitarian control – may seem more diffuse and further off than near-term concerns, but they represent significant challenges in AI development. Our proposed solution of enhanced Model Specification principles aims to address these issues proactively, providing a foundation for more robust and ethically aligned AI systems.