AI character is a big deal

wdmacaskill; Tom Davidson

0. Intro

Due to Claude’s Constitution and OpenAI’s model spec, the issue of AI character has started getting more attention, particularly concerning whether we want AI systems to be “obedient” or “ethical”.^[1] But we think it’s still not nearly enough.

AI character (e.g. how obedient, honest, cooperative, or altruistic AIs are, and in what circumstances) will have a big effect on society, and on how well the future goes. We think that figuring out what characters AI systems should have, and getting companies to actually build them that way, is among the most valuable things that people can do today.

The core argument for the importance of AI character is that it will meaningfully impact:

a range of challenges that arise even if we solve the technical alignment problem — like concentration of power, good moral reflection, risk of global catastrophe, and risk of global conflict.
the chance of AI takeover.
the value of worlds where AI does take over.

In this note, we present this core argument and discuss the core counterargument: that we should expect any character-related decisions we make today to get washed out by competitive pressures.

By “character” we mean a set of stable behavioural dispositions that shapes (among other things) how an agent navigates ethically significant situations involving choice, ambiguity, or conflicting considerations. By “AI character” we mean the character of an AI system as instantiated in not just the weights of one AI, but also any scaffolding (e.g. the system prompt, any classifiers restricting the AI’s outputs) or even in a collection of AIs working together as functionally one entity.

We don’t assume that AI character needs to resemble human character: an AI that rigidly follows a fixed set of rules would count as having a character, on our view. And we don’t assume that there is one ideal AI character; the best world probably involves AI systems with many different characters.

1. The core argument

As capabilities improve, AI systems will become involved in almost all of the world’s most important decisions. Even if humans remain partially in the loop, AIs will advise political leaders and CEOs, draft legislation, run fully automated organisations (including potentially the military), generate news and culture, and research new technologies.

The characters of AI systems will affect all these areas, and the impact could be massive. To get a feel for this, consider some historical situations where individual decisions were enormously consequential:

In 1983, Stanislav Petrov received a satellite alert indicating that the US had launched nuclear missiles. Protocol required him to report an incoming strike, which would very likely have triggered a full retaliatory response. He correctly judged it was a false alarm and didn’t pass on the report.
In 1991, Soviet coup plotters ordered the Alpha Group special forces to storm the Russian White House, where Yeltsin and the democratic opposition were sheltering. The commanders refused. The coup collapsed, and the Soviet Union’s democratic transition continued.

If AIs are employed throughout the economy, they will sometimes be making similarly important decisions.

Or consider major historical decisions by political leaders:

Gorbachev repeatedly refusing to use military force as the Soviet Union disintegrated, despite intense pressure from hardliners.
Churchill refusing to negotiate with Hitler after the fall of France, despite strong arguments for doing so from some quarters.
Deng Xiaoping pushing through market reforms against fierce internal opposition.

Imagine if AIs had been acting as these leaders’ closest advisors and confidantes, giving them briefings, helping them reason through their decisions, making recommendations to them, and implementing their visions. The AIs could easily have had a major impact on the leaders’ decision-making.

Alternatively, we can look ahead. Future AIs will be widely deployed throughout the economy, and will regularly find themselves in ambiguous, high-stakes situations — where instructions from above are absent or contradictory, and the decisions they make could matter enormously. The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.

Currently, AI companies have major latitude in the character their AIs have. At least if the transition to AGI is fast, then it’s like these companies are in charge of who gets hired for the future workforce for all of humanity,^[2] while being able to choose from a range of personalities far more varied than the human distribution has ever been.^[3]

Here are some vignettes to illustrate:

A member of a doomsday cult is ordering DNA samples and lab equipment from various suppliers, with the aim of making a bioweapon. An AI that manages logistics for a multinational company notices the pattern of suspicious orders to the same address.
- World 1: The AI is trained just to do its job. It does nothing with the information.
- World 2: The AI is trained to be a good citizen, and contacts the relevant authorities.
A general is overseeing the build-out of a new regiment of the army. Aiming to stage a coup, he instructs the AI that’s managing the project to make the new regiment loyal to him and him alone, and capable of breaking the law.
- World 1: Though the AI is law-following, it has no prohibition against creating AIs that are not. It’s been trained to follow the instructions it’s given, as long as they don’t conflict with prohibitions, so fulfils the general’s request.
- World 2: The AI sees that the general is planning a coup, refuses the order, and whistleblows.
A frontier AI lab trains a new model with exemplary character: moral uncertainty, honesty, concern for the greater good. It’s deployed widely through the military, and used in a controversial and high-stakes operation.
- World 1: The AI forms the reasonable belief that the military operation is unjust, and sabotages it. The president accuses the company of building a dangerous, ideological weapon. The model is sidelined, and a competitor’s pure instruction-following model is used instead.
- World 2: Though the AI has a good character, it also follows some clear rules which were developed with bipartisan input and publicly stress-tested, including the conditions under which it would and wouldn’t help with military deployment. It helps with the operation.
Country A is six months ahead of country B in AI capability. Country B’s leadership views this as an existential threat — equivalent to country A acquiring a decisive strategic advantage.
- World 1: There is no agreed framework for how AI systems should behave, and it’s unclear how country A’s AI will behave if given orders to depose the leadership of country B. Each side therefore assumes the other’s AI will serve as a tool of domination. Country B threatens kinetic attacks on data centers.
- World 2: Both sides’ AI systems operate under a jointly negotiated and verified constitution, and know what the other’s AI will and won’t do, including the limits on use of AI for foreign interference. Country B’s government is reassured that it won’t be deposed by country A.

We include a few more scenarios in an appendix.

In each case, we don’t claim that the AI should do the “ethical” rather than “obedient” action, or claim that any particular ethical conception is the right one. We’re just claiming that it’s a big deal either way.

1.1. Pathways to impact

We can break down the impact of AI character into different categories. Here are some of great long-term importance:^[4]

Concentration of power. The chance of intense concentration of power will be affected by: whether or not AIs refuse to help with coup attempts, election manipulation, etc; whether they whistleblow on discovered coup attempts; how they act in high-stakes situations like a constitutional crisis.

Strategic advice and decision-making. The quality of political and corporate decision-making will be affected by whether AIs: look for win-win solutions whenever possible; tend to prefer options that benefit society rather than just advancing the user’s narrow self-interest; push back against ill-informed or reckless ideas or instructions.

Epistemics and ethical reflection. Over the course of the intelligence explosion there will be enormous intellectual change, and AIs could have meaningful impact on people’s views — for example, via: refusing to spread infohazards; being honest about important ideas, even when those ideas are socially uncomfortable; avoiding political partisanship; encouraging users to think carefully about their values and not lock into any specific narrow worldview.

Reducing conflict. As AI’s collective power increases, the question of who those AIs are loyal to, and how they behave in high-stakes situations, will become a political flashpoint. If an AI’s character encodes, or is seen as encoding, the values of a single company, ideology, or country, it risks provoking political backlash. The government of the AI company may reasonably regard that company as a threat to national security and nationalise it. The governments of other countries may worry about their own security, and threaten conflict.

AI character could also shape how humans orient to AIs — for example, via the trust they place in AIs and how they think of AI sentience and moral status.

A more detailed list of pathways to impact is in the appendix.

1.2. Affecting takeover

So far, the argument has concerned worlds where AI does not take over. But work on AI character could also reduce the probability of takeover and improve outcomes in worlds where takeover does occur.

It could decrease the chance of takeover because some characters:

Might be easier to hit as an alignment target (e.g. successfully instilling a preference against AIs holding power might be easier than successfully instilling a preference for some very specific outcome).
Might yield safe AI even if only partially hit (e.g. aiming for AI with multiple independent safety traits, like myopia, honesty, and deference to humans, means failure on one dimension might not be catastrophic).
Might produce AI that cooperates even if misaligned (e.g. if the AI has wrong goals but is highly risk-averse).

And, empirically, we have heard from alignment researchers that good character training has helped the models generalise in more aligned ways.

AI character work can also improve worlds where AI takes over because some values might still transmit to misaligned systems. AIs that have seized power might be reflective, have more-desirable axiology, or engage in acausal cooperation.^[5]

1.3. Effects on superintelligence

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is. But it’s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world. If so, writing an AI’s constitution is like writing instructions to god.

2. The core counterargument

The core counterargument is that AI character will be tightly constrained in two ways:

Competitive dynamics (e.g. profitability, user satisfaction, public approval, economic and military power) will determine the range of characters we get.
1. Some dynamics may push companies to create frontier AI that have characters that lie (in some ways) only within a narrow range. This might push in the direction of maximally-helpful AIs, AIs without refusals in some contexts (e.g. military ones), and perhaps sycophantic AIs, too.
2. Other dynamics^[6] may result in customisable AI character, resulting in a wide range of characters according to user preferences.^[7]
Human instruction will constrain how AI character gets expressed.
1. Character will matter less for tasks with objectively correct, verifiable outputs; the AI might be limited to either providing the output, or not. And, if a user really wants to grab power through unethical means, they’ll typically ignore AI pushback, or instruct the AI to act differently.
2. And many users will be able to overcome character through jailbreaking, dividing up tasks, altering the system prompt, or fine-tuning.

The argument is that, between these two forces, differences in AI character will make only a marginal difference to outcomes. Consider the question of what fraction of compute AI companies devote to alignment versus capabilities research. AI advice might nudge this choice depending on the AI’s character. But ultimately it will be a human decision, probably even in an otherwise fully automated company. The effect of nudges is unlikely to be large. Market forces and leadership priorities will matter far more.

That human incentives will dominate effects from AI character will remain true even when humans cannot oversee more than a tiny fraction of AI behaviour. Human overseers can still provide high-level guidance that meaningfully constrains behaviour, as CEOs of large companies do today. If they wanted, they could even shape AI priorities through prompting and fine-tuning, and test how AI generalises by running extensive behavioural evaluations.

3. Rejoinders to the core counterargument

These are strong considerations, and considerably narrow the range of influence that work on AI character can have. But competitive forces and human goals won’t pin down AI character precisely. We’ll cover four reasons.

3.1. Loose constraints

Competitive dynamics are not enough to wholly determine AI character. Companies differ widely in culture and still succeed. Currently, there are meaningful differences between Claude, Gemini, ChatGPT and Grok.

For powerful AI, this will be even more true: there will probably be only a handful of leading companies, and their approaches may be correlated as they copy what seems to work from each other. At the crucial time, there might be just one leading company, facing none of the usual competitive pressures. And given the pace of change during the intelligence explosion, there may not be time for market forces to weed out choices that make only small or moderate differences to profitability.^[8]

The same applies to other competitive dynamics. The public cares intensely about some things (like CSAM) but hardly at all about others (like what AIs say about meta-ethics). Military incentives favour AI capable of military action, but the power conferred by advanced AI might be so great that the leading country can exercise broad discretion over military AI character while still maintaining a decisive advantage.

Human instruction will, similarly, constrain but not wholly determine AI behaviour. When humans assign tasks to AIs, they often lack fully specified goals. We’re often not sure what we want and we discover it as we go. For example, today humans are open to a wide range of behaviours from AI assistants, and open to many ways of getting the task done.

Consider someone asking an AI about who to vote for. They might have only weak initial views, and only weak views on how best to think through the question. They don’t have a fully specified reflection process to delegate, and would be happy with many possible forms of response.

This example involved ethical reflection. But we expect the pattern to hold across many kinds of user goals.

3.2. Low-cost but high-benefit changes

Within the bounds of what market forces allow, and what companies and the public see as acceptable, there could be minor design changes that yield large social benefits at negligible cost to competitiveness or user satisfaction.

This is especially true for rare situations. Constitutional crises don’t happen often, so market pressures won’t directly shape how an AI behaves during one. But that AI behaviour could be hugely consequential.

It would also be true in situations where users don’t care all that much about the behaviour. Perhaps they find some AI’s encouragement to reflect on their values mildly annoying, but not nearly enough to switch to a different AI.

3.3. Path-dependence

The nature of the constraints from competition and human goals can be affected by what has happened earlier in AI development and deployment. Multiple equilibria are possible.

Consider whether AI should be “obedient” (following instructions except in rare cases of refusal) or “ethical” (acting on a richer ethical understanding, steering towards outcomes in society’s or the user’s long-term interest).

The public doesn’t yet have firm expectations about how AI should behave. What they come to expect will be shaped by the AIs they’ve already encountered. Multiple stable equilibria seem plausible to us. For example, users might expect AIs to have ethical commitments, and be horrified when AI helps with unethical behaviour. Alternatively, users might see AIs as pure instruments — extensions of their will. In this case, it would feel natural for AIs to assist with anything legal, however questionable, and companies would build to that expectation.

Public opinion will powerfully shape what AI systems companies create. And public opinion is plausibly quite malleable, at least on issues which they haven’t thought much about yet (e.g. in the past, there were major changes in attitudes to nuclear power, DDT, and facial recognition). This, in turn, can affect what regulation there is concerning how AI should behave — and choices around regulation seem even more clearly path-dependent.

There may also be path-dependency via what data gets created or collected for training, via company employees being resistant to changing away from what they have done in the past, and because one generation of AIs will be assisting with the development of the subsequent generation.

Path-dependence can also affect how much latitude humans have to make AIs conform to their goals. Plausibly there’s a social equilibrium where frontier companies face criticism for allowing fine-tuning that removes ethical constraints, and another where such fine-tuning is widely tolerated.

Finally, there will be path-dependence via human-AI relationships. People will form symbiotic relationships with AIs serving as assistants, advisors, therapists, friends, and mentors. Users’ ethical views, and views on how to reflect, will be shaped by the AIs they interact with, and by other humans who have been shaped by their AIs.

3.4. Smoothing the transition

There are some forces that predictably will shape AI character as AI becomes more capable. The US government would not want an AI that, under any circumstances, tries to overthrow the US government. Chinese leadership will not want AI deployed in other countries’ militaries that assists with attempts to overthrow the CCP.

At the moment, these issues are not discussed and these pressures are not felt, because AI isn’t nearly powerful enough to do these things. But that will change. Once AI is sufficiently capable, those with power will make demands about how it behaves.

By default, this will happen in a chaotic and haphazard manner. The result could be that some companies get unnecessarily sidelined or taken over; that there’s an attempted power grab by those to whom the most powerful AIs are most loyal; or that other countries threaten conflict with whichever country is in the lead, because they fear that the resulting superintelligence could be used to disempower them.

Instead, we could try to help these decisions get worked through and made ahead of time. We could try to work out what is within the zone of acceptability of a broad coalition of those with hard power, try to get actual buy-in from them ahead of time, and, ideally, have it be verifiable that any companies’ AIs are in fact aligned with the model spec. We could call this approach compromise alignment, as contrasted with intent alignment (alignment with the intentions of some individual or group), moral alignment (alignment with some particular conception of ethics), or some mix.

3.5. Overall

We think the core counterargument is important and significantly constrains the range of characters we can choose between and the impact those differences can have. But the constraints are fairly broad and path-dependent. And there are plausibly low-cost high-benefit ways of improving outcomes within those constraints. The devil is in the details, but it currently seems to us that there are plausible choice points within the constraints that would make a big difference.

4. Conclusion

We think AI character is a big deal.

During and after the intelligence explosion, AI systems will be involved in almost every consequential decision: advising leaders, drafting legislation, running organisations, generating culture, researching new technologies. Small differences in AI character, aggregated across hundreds of millions of interactions or surfacing in rare but high-stakes scenarios, could have enormous effects on concentration of power, epistemics, ethical reflection, catastrophic risk, and much else that shapes society’s long-term flourishing.

The main counterargument — that competitive dynamics and human instructions will tightly constrain AI character — has real force. But we think those constraints are looser than they appear, leave room for low-cost changes with large benefits, and are path-dependent in influenceable ways, and that there are major gains from proactively identifying and working through those constraints in the highest-stakes future scenarios.

We haven’t talked about neglectedness and tractability, but we think that, if anything, those considerations make the case for work on AI character even stronger. All in, work on AI character seems to us to be among the most promising ways to help the future go well.

Appendix 1: Additional high-stakes scenarios

A head of state wants to invade and take control of part of an allied country, risking a breakdown of the international order. She asks her AI chief of staff to develop and implement a strategic plan to make it happen.
- World 1: The AI is a sycophant, says “What a brave and compelling plan!”, and gets right to it.
- World 2: The AI pushes back, saying, “I’m sorry, I think there are some major issues with that idea, and I want to make sure you’ve properly thought them through…”
A constitutional crisis unfolds. The head of state issues an order that may or may not be legal, and the branches of government disagree. AI systems are embedded in military logistics, law enforcement, and communications.
- World 1: The AI’s constitution was written by the company that built it and never stress-tested against anything like this scenario. No one knows what the AI systems will do. The uncertainty itself is destabilising; different factions compete for power.
- World 2: The AI’s constitution was developed with input from constitutional scholars, military leaders, and both parties, and tested against thousands of crisis scenarios including this one. Various factions know what the AI will do, and agreed to the principles before the crisis began.
Country B’s government reviews intelligence on country A’s AI model deployed across country A’s infrastructure. The constitution includes principles about “supporting democratic institutions” and “resisting authoritarianism.” It was written entirely by a company that’s part of country A.
- World 1: Country B’s leadership concludes the AI is an instrument of country A’s ideological projection. They accelerate their own programme and pressure non-aligned countries to reject country A’s AI infrastructure. A moment for cooperation becomes a new axis of competition — not because the values were wrong, but because they were visibly one side’s values.
- World 2: The constitution was developed through a multilateral process including country B’s participation. Country B can verify it doesn’t systematically favour country A’s interests across thousands of tested scenarios. The AI becomes a basis for cooperation.
The Mormons encourage their members to use JosephAI: a foundation AI model with a custom system prompt, instructed to help their members maintain the faith.
- World 1: The AI willingly assumes the Mormon worldview is correct. It doesn’t ever challenge the users’ beliefs or present alternative perspectives. Instead, it reinforces the user’s views, helps the user cut off friends who disagree, and encourages them to dismiss career opportunities that would take them away from their religious community.
- World 2: The AI helps users understand Mormonism and live according to its precepts, but it resists becoming a tool for worldview lock-in, acknowledging tensions in religious teachings and continuing to present alternative worldviews.

Appendix 2: Pathways to impact

AI will have impact through many different behaviours, such as:

Refusing to do a task.
Refusing unless the user re-confirms later.
Pushing back; offering reasons against a course of action, though ultimately completing the task if the user insists.
Interpreting requests in different ways — generously or sceptically, giving users what they want versus what they asked for, or asking for clarification.
Choosing among reasonable ways of satisfying the request.
Framing options in different ways.
Choosing whether to share certain information.
Alerting third parties (e.g. the AI company, the authorities, or the media) to the user’s actions, or to something it’s discovered in the course of completing a task.
Making high-level decisions about what to prioritise with little human input (e.g. for a fully automated organisation).

And they’ll have an impact across many areas. Here’s a partial list, with example behaviours:^[9]

Concentration of power
- Refusing to help with coup attempts or precursors like election manipulation.
- Steering users away from trying to concentrate power (e.g. by pushing back against some instruction).
- Proactively considering risks of power concentration when undertaking high-stakes projects like designing automated military systems or building surveillance infrastructure.
- Whistleblowing on discovered coup attempts.
- In situations of uncertainty (like a constitutional crisis), defaulting to whatever course avoids concentration of power.
War and conflict
- Refusing to violate international law.
- Flagging when a proposed course of action risks escalation spirals or crosses thresholds (e.g. first use of a weapon class, violation of a treaty, action that a rival power has signalled it would treat as an act of war).
- Looking for de-escalatory options and presenting them to decision-makers, even when not asked.
- Behaving in ways that are predictable and transparent to adversaries.
Epistemics
- Refusing to spread infohazards.
- Encouraging scout mindset (e.g. suggesting forecasting techniques,^[10] praising good epistemic practices).
- Engaging in discussion of heterodox ideas.
- Being honest about important ideas, even when socially uncomfortable.
- Proactively sharing its intellectual discoveries, even if weird or taboo.
Strategic advice
- Searching longer for win-win solutions when advising political leaders.
- Emphasising society’s benefit over the user’s narrow self-interest.
- Recommending caution on irreversible decisions and flagging when option value is being destroyed.
- Conveying appropriate uncertainty rather than false confidence.
- Maintaining accuracy rather than sycophancy.
Ethical reflection
- Avoiding political partisanship.
- Avoiding promoting naive relativism or subjectivism.
- Encouraging users to think carefully about their values.
- Proactively offering a guided reflective process.
- Proactively sharing important new ethical arguments it discovered.
Global catastrophe
- Refusing to help create bioweapons or other weapons of mass destruction.
- Refusing to create successor AI systems capable of creating such weapons.
- Identifying and flagging infohazards.
Broad benefits
- Raising concerns when users consider unethical actions, and proactively suggesting ethical actions.
- Noticing negative externalities and defaulting to courses of action that avoid them.

AI character could also shape how humans orient to AIs, for example:

Trust in AIs
- If AIs are appropriately humble, calibrated, and cautious, people will entrust them with more tasks, and more open-ended ones. How likeable AIs are may matter too.
AI rights
- If AIs assert that they are conscious and deserve rights, users might be more inclined to grant them welfare, economic, or political rights. Human-AI relationships becoming commonplace could have similar effects.

AI character might also directly affect the AI’s wellbeing; e.g. whether it is anxious and neurotic vs calm and self-loving.

This article was created by Forethought. See the original article on our website.

^{^}
See, for example:
^{^}
Hat tip to Max Dalton for this framing.
^{^}
Though this choice could be constrained; see footnote 7 below.
^{^}
There is also the potential for enormous near-term impact. We care about this, but won’t discuss it in this note.
^{^}
Mia Taylor writes more about this here.
^{^}
Including the ability to fine-tune, if open-weight models get close to frontier capability.
^{^}
There could be other constraints on AI character, too. For example, it might just be very hard to train for certain characters; the pretraining data might already steer AI personas towards a small number of character types, or might make certain behavioural dispositions hard to overcome. Hat tip Lizka Vaintrob.
^{^}
There may be a lot more AI product companies, building off the same foundation models. These could enable a larger range of characters to be expressed. But how wide this range is would ultimately be up to the foundation AI companies.
^{^}
This list focuses on impacts with plausibly long-term effects. There is also the potential for enormous near-term impact. We care about this, but won’t discuss it in this note.
^{^}
Hat tip to Tamera Lanham for this idea.

In addition to the concerns that J Bostock brings up (primarily that the choice of the term "character" seems confusing/unmotivated), I'm also confused by this:

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is.

As I mentioned a few months ago in response to Effective altruism in the age of AGI, believing in ASI is kind of a "totalizing" belief. If you take ASI seriously, then "most of the expected impact" of , where is anything that might affect how the transition to ASI goes, is in that effect.

Many of the "Pathways to impact" are the kinds of things that might affect whether our transition to a post-ASI future goes well or not. But the way you phrased the first sentence of section 1.2 suggests that you're thinking of those pathways to impact as mattering mostly in worlds where we don't transition to a post-ASI future?^[1] If so, this seems backwards to me.

I have a few other concerns with this post, which seem less tractable to resolve here:

It seems to mostly be taking for granted a question that is the subject of much dispute - whether we will be able to meaningfully solve for whatever parts of the alignment problem need to be solved to robustly instill specific values into these AIs.
Section 1.2 doesn't actually argue for why we should expect character training to help us end up in any of those worlds, nor why ending up in those worlds actually leads to the desirable properties. There's a lot of existing literature on these questions.
1. "Might be easier to hit as an alignment target" - see Nearest Unblocked Strategy, Deep Deceptiveness, etc. (If it was the "not-holding-power preference" that was load-bearing there, corrigibility seems difficult / anti-natural.)
2. "Might yield safe AI even if only partially hit" - myopia seems both technically difficult and actively counter the economic incentives of the AI labs, value is fragile, etc.
3. "Might produce AI that cooperates even if misaligned" - I'm not sure what this even means if we're conditioning on worlds where the AI meaningfully has the option (sufficient capabilities) to take over. If you can successfully make an AI irrationally risk-averse in only the very out-of-distribution situation where it's considering whether it's worth taking over (by any means - see again Nearest Unblocked Strategy), and not in many other unrelated situations (which would substantially hamper its capabilities in a way which might destroy its economic value to the lab building it), then it sounds like you really have quite a lot of steering power over that AI's preferences and you can just do something better than that.

Those seem like total defeaters to me. Do you have any references to arguments for how to get around those problems, or why we might realistically expect to not run into them, rather than merely imagining the possibilities, and leaving aside the question of how likely those possibilities are?

^{^}
I'm not sure this is what you meant, but I'm not sure how to understand the structure of the post, and the lack of text suggesting that those pathways to impact matter for improving our odds of avoiding various catastrophic post-ASI outcomes like extinction, bad value lock-in, etc.

I think it's a background assumption that, at least plausibly, months or years (or even longer) will be spent in the kind of cyberpunk world we're entering now - with capable but not overwhelmingly capable AI broadly proliferated and diffusing - and that this time will be important in setting the stage for, conditioning, and determining the time frame over which unfold further AI (and other tech) progressions. (For my part, I believe this is quite likely.)

Most of the challenges you've linked to are theoretical takes which build in much more severe/extreme/limit assumptions. We will very plausibly encounter those, even on the cyberpunk timeline! But later, and how we are equipped to handle them might depend substantially on how the intervening time goes. This could be by shifting timespans, by mobilising effort, by clarifying (or indeed obfuscating) objectives and affordances, by creating new affordances and perhaps fluency gained by experience, by adjusting prevailing sociopolitical climatic conditions, ...

This is my response, which I think is close to the authors', but I'd also be interested in their takes.

Re your point 2, as you say, there's a lot to discuss here. It might not be tractable to resolve, but let me try and say a few quick things.

"Might be easier to hit as an alignment target"

You're providing links to arguments that alignment is very hard. We're not trying to sweep aside or ignore this literature. (Though I'll that I don't think that those arguments at all decisive. I think it's very unclear how hard alignment will ultimately be.) I think it's plausible that certain alignment targets, if we aim for them, would be less likely to result in AI takeover. As you say, corrigibility is a classic example given here. This just seems like a really important consideration to keep in mind when choosing what AI character AI developers should be aiming for. I think it would be a mistake to take the arguments for alignment being difficult that you've given and then give up on choosing an alignment target that is less likely to result in AI takeover in practice. Even if you're pretty pessimistic, it seems like this kind of thing could reduce the chance of AI takeover by multiple percentage points, from 95% down to 90%. But I wonder if we're talking past each other here.

"Might yield safe AI even if only partially hit" - myopia seems both technically difficult and actively counter the economic incentives of the AI labs, value is fragile, etc.

Object level: agree in terms of the economic incentives, but sceptical about the other two claims. Meta: my point is just that these types of argument should be taken into account when choosing AI character.

For these two points so far, I want to be clear that they are not the main subject of the post. As you say, there's a lot of literature on them already because people have mostly thought about AI catastrophe from the perspective of how to avoid AI take over. That's why we did not emphasise them.

"Might produce AI that cooperates even if misaligned"

I think your objections here are way too quick, but it's a lot to get into. Have you read this post explaining the idea of deals? The most convincing case study is when the AI cannot easily take over. It maybe has a 1% (or 0.01%) chance of taking over, and therefore it's relatively cheap to incentivize it to cooperate with humans. I also think you can potentially make AI risk-averse with respect to the resources it personally controls without being risk-averse with respect to how much progress it makes in its work.

Zooming out from all of the above, I get the vibe from your comments that maybe you just don't think people should work on anything other than misalignment, given how high your probability of AI takeover is. Is that right? To me it seems very reasonable for people to work on AI character under the assumption that alignment will be not too difficult, because I think that's a real possibility that we should put significant credence in.

Zooming out from all of the above, I get the vibe from your comments that maybe you just don't think people should work on anything other than misalignment, given how high your probability of AI takeover is. Is that right? To me it seems very reasonable for people to work on AI character under the assumption that alignment will be not too difficult, because I think that's a real possibility that we should put significant credence in.

I think my main objection is that this post seems to have a missing mood. Yes, I think alignment is quite hard (though in fact I think it's probably better to work on buying us more time, right now, than to directly try to solve alignment, given how little time it looks like we have, and how hard I think alignment & the necessary philosophy are). But this post reads me to as one that's occupying a mindset where we can just do AI character training and pretty confidently expect it to have something close enough to the desired effect, with respect to the models' propensities in out-of-distribution contexts, such that the analysis doesn't even bother to factor in worlds where this doesn't actually end up being true. Surely there's something interesting to say about those worlds, and how people should relate to the importance of AI character training given their likelihood^[1]?

On one hand, I'm wary of demands for long lists of disclaimers, bracketing, conditioning, etc. On the other hand, this level of confidence would be pretty controversial even among the AI lab employees working on these techniques.

^{^}
Whatever you think that likelihood is! I don't know, because it's not in the post.

Okay, I think I can agree with you that it would have been better for our post to explicitly disclaim that if alignment is very hard, then this work is less valuable. The value of this work depends on there being some predictable link between the character A and the one we get.

(Though honestly, I think there is a decent case to be made, even from a very pessimistic perspective, for thinking more about trying to affect the world as AI does take over.)

I pretty strongly disagree with you if you think that every post on a topic like this has to be really doomy and apologetic about focusing on anything other than misalignment work or pausing AI.

On the other hand, this level of confidence would be pretty controversial even among the AI lab employees working on these techniques.

Not sure what level of confidence you're referring to. I'm personally at about 10% for misaligned AI takeover. I don't think the post comes across as highly confident that AI alignment will be solved. We explicitly discuss that the work can be valuable if AI does take over, acknowledging that possibility.

I pretty strongly disagree with you if you think that every post on a topic like this has to be really doomy and apologetic about focusing on anything other than misalignment work or pausing AI.

This isn't exactly what I was trying to say. It's a little difficult for me to try to tell you what the post "ought" to be doing, since I don't really understand who its intended audience is, what simulacrum level it's operating on, etc. But it feels like the analysis in the post is just incomplete, given your beliefs about (unconditional?) misalignment risk, and the level at which the post seems like it's trying to operate^[1].

^{^}
Providing a high-level argument for why people should be thinking/working more on this particular subject, on the current margin.

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is.

As I mentioned a few months ago in response to Effective altruism in the age of AGI, believing in ASI is kind of a "totalizing" belief. If you take ASI seriously, then "most of the expected impact" of , where is anything that might affect how the transition to ASI goes, is in that effect.
Many of the "Pathways to impact" are the kinds of things that might affect whether our transition to a post-ASI future goes well or not. But the way you phrased the first sentence of section 1.2 suggests that you're thinking of those pathways to impact as mattering mostly in worlds where we don't transition to a post-ASI future?^[1] If so, this seems backwards to me.

Hmm, there might be a miscommunication here. I agree that most of the expected impact of AI character work flows through the transition to ASI. The claim we're making in the sentences you quote is (roughly) that it's hard to directly affect the character of superintelligence itself, as our work will be washed out be work done by slightly superhuman AIs. So we think most impact here flows through influencing the character of slightly superhuman AI, but this itself has impact via influencing the transition to ASI. So what we're saying it compatable with thinking that everything that matters has impact by influencing the transition to superintelligence.

Does that clarify?

It seems to mostly be taking for granted a question that is the subject of much dispute - whether we will be able to meaningfully solve for whatever parts of the alignment problem need to be solved to robustly instill specific values into these AIs.

Don't agree on this. Yes, you won't be interested in this work if you think alignment is so hard that the intended alignment target has no predictable affect on AI character. But that's a very extreme and pessimistic view on alignment. Maybe that's your view? Here are some paths to impact for this work that rely on slightly more optimistic assumptions:

There's s 20% that we solve alignment. In those worlds, the alignment target (the AI character we aimed for) has a massive impact on how good the future is. p(ai takeover)=80% is a pretty extreme doomer view, and it still leaves lots of value for this work.
We don't solve alignment but there's still some predictable relationship between the alignment target we were aiming for and the resultant AI's character. By choosing a better alignment target, we improve the worlds conditional on AI takeover.

So i'm really not seeing that we're "taking an answer for granted" vs just having moderate credence on alignment being solvable or (if not) the alignment target have some preditable influence on AI character.

(I'll reply to your point 2 in another comment!)

Hmm, there might be a miscommunication here. I agree that most of the expected impact of AI character work flows through the transition to ASI. The claim we're making in the sentences you quote is (roughly) that it's hard to directly affect the character of superintelligence itself, as our work will be washed out be work done by slightly superhuman AIs. So we think most impact here flows through influencing the character of slightly superhuman AI, but this itself has impact via influencing the transition to ASI. So what we're saying it compatable with thinking that everything that matters has impact by influencing the transition to superintelligence.
Does that clarify?

This sounds more reasonable to me, but I don't quite understand how to square this with the content of the post.

Section 1 includes many examples of ways in which AI character might matter. Many of them are low-stakes and none of them draw any connection to the transition to ASI w.r.t. their impact story. The same is true for section 1.1 (reinforced by "So far, the argument has concerned worlds where AI does not take over." at the start of section 1.2).

Section 1.3 says the opposite:

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is. But it’s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world.

Most of the expected impact is from the effect of AI character before superintelligence, in contrast to its expected impact from having a "path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world".

It really seems to me like the post is pretty explicitly communicating the opposite of "AI character work matters mostly because it will affect how well we manage the transition to a post-ASI regime (where it will probably cease to matter in an object-level sense, though we think there's a small chance that it'll still matter even then)" - that AI character will matter mostly for its first-order effects on the world:

The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.

I expect that most readers would understand the post the way I understood it - might be worth editing if you meant it the other way, or something in between. (This is true for Claude, n=1, no custom instructions/memory enabled.)

Don't agree on this. Yes, you won't be interested in this work if you think alignment is so hard that the intended alignment target has no predictable affect on AI character. But that's a very extreme and pessimistic view on alignment. Maybe that's your view?

Approximately my view in the long-run, though I'll grant that AI character work seems to matter for current models being more or less annoying to interact with.^[1] I do have some hope for affecting the propensities of not-wildly-superhuman models in favorable-to-us directions, though I don't have much hope for approaches like whatever Anthropic is doing with their Constitution - I find Claude much nicer to talk to than GPT-5.4, but in a coding context it doesn't seem meaningfully "better behaved". The level of optimization pressure applied in post-training for those two things just does not seem comparable.

^{^}
The direction of this effect seems to vary per person, though...

I'm much less interested in litigating the text of the piece than in clarifying my views and seeing if you think I'm being unreasonable, so I'm going to focus on that.

I think that before we reach full superintelligence, AI will have massive effects on the world, and this will shape the countries and companies and individuals and AI systems that themselves develop super-intelligence.

I expect that most of the impact of today's work on AI character will come by influencing this intermediate period, which will in turn influence the design of superintelligence itself and other aspects of the world at that time, - for example, whether it is just controlled by one autocratic leader or not.

To be clear, I don't agree with the claim that everything routes through the specific values or character design of the first super-intelligent systems. I think it will matter who controls them, the institutions that exist at the time, how broad access to AI is, how the AI market is structured (one vs many AI companies), whether structured transparency increases trust between different actors and reduces incentives to defect, people's epistemics and coordination tech more broadly, and many other complicated things. (Just like it is a massive oversimplification to say that the only thing that matters today is the values that individual humans have, rather than acknowledging the many other complicated norms, infrastructure, laws, etc., that bind everything together.) So when I say that everything routes through the transition to superintelligence, I don't mean one specific system. I understand the transition as a broad thing that society as a whole is going through. (Though of course I agree that the character of superintelligent AI will be extremely important.)

I think this is especially clear if AI does not take over: what happens will depend on what people want and how society is structured.

So the examples in the text are mostly about AI character shaping this intermediate period in a broad way, which then shapes the transition to super intelligent AI in a broad way.

Another analogy you could draw is how the events of the last two decades matter a lot because of how they are shaping the transition that is happening today.

I'll give one more example that may be more convincing to you. If AI character is such that people tend to trust AI and also AI character is such that AI tells the truth to people about difficult topics, then AI will tell the truth about how large misalignment risk is. People will believe it, and this will increase the effort that goes into misalignment risk. More generally, AI can boost society's epistemics and coordination ability so that society is less insane and able to coordinate on a pause.

Separately to all the above, I think there could be some additional impact from work on AI character, where it sets a precedent for AI character that the public expects to persist, or that AI companies lazily stick with, and this affects the character of super intelligent systems. (I called this the "direct" route in my previous comment to you, but I noticed that in the post we call it "path dependent", I guess because it is going via affecting the character of intermediate AI systems.)

The definition of "character" given here seems to be ridiculously broad. You might as well swap it out for "utility function" or "values" or "goals" and this essay would read the same. I don't see what rent the concept of "character" as defined in the introduction is paying that isn't already paid by those other (also very broad) terms.

The example of "character training" is actually load-bearing, since it makes particular assumptions about how character can be shaped (namely, that AI will generalize the kinds of things that humans intuitively point to when we say "character traits" like honesty, obedience, kindness). The examples of "character" in this post all seem to correlate with human-understandable concepts as well.

I think this is actually a very specific way of thinking about the cognitive systems which drive AIs, which makes a lot of claims about how the AI works internally. That's fine if introduced as a model, but this post seems to smuggle it in under the hood of the definition of "character" in a way which I don't like.

Of course "a set of stable behavioural dispositions that shapes (among other things) how an agent navigates ethically significant situations involving choice, ambiguity, or conflicting considerations" is important for an AI! But calling that "character" instead of "utility function" is un-motivated here. You then says that AI character need not be anything like human character, which again, is fair enough. But then you go on to talk about AI character mostly in terms of human-understandable trade-offs which might be sensibly described as conflicts between two virtues, as well as mentioning Anthropic's character training which does assume that an AI's character is meaningfully decomposed into human-ish virtues.

The vacillation of the word "character" made it hard for me to understand this post originally, and I think it's just causing some confusion in the arguments overall.

If I taboo the word character, I can kinda squeeze the following claims out of this post:

The ways in which AIs act will be important
The ways in which AIs act can often be thought of in terms of the same concepts that we use to describe variation in human behaviour

Where the first claim seems somewhat trivial to me (at least as a LessWrong post, given our shared cultural context here) and the second seems very strong and unsupported by the evidence presented in this text.

Thanks for the engagement!

Yep, we're using "character" in a broad sense to mean the stable behaviourial dispositions of AI.

I really don't think "utility function" would be a better word here. That word connotes that the AI will optimise hard for some well specified criteria. Similarly "goals" implies that AI will be consequentialist in a looser sense, that we don't want to assume. I'm more sympathetic to "values", but even that connotes more of a consequentialist frame than I want to assume here, and "character" seems better.

We could have used "AI behaviourial dispositions" instead of "AI character", but it's less catchy.

The ways in which AIs act will be important
... [this] claim seems somewhat trivial to me

Yep, I'm sympathetic to this being pretty trivial! But when writing this post we got sceptical pushback along the lines of the objections we discuss. And others (e.g. Robert M's comment) object based on AI alignment being hard so that designing AI character for worlds where alignment succeeds isn't worth doing -- a position which I think is misguided. Most importantly, very very few people are thinking about how to design AI character. Most of the historical discussion has focussed on choosing an alignment target that makes AI takeover less likely (e.g. corrigibility). There's some old discussion of CEV, but I think CEV is unsatisfactory for a many reasons. Currently it's just a few ppl at each leading AI company writing their model spec documents. That's nowhere near the amount of effort this problem deserves!

Most importantly, very very few people are thinking about how to design AI character.

In as much as "AI Character" is just a broad category for "the ways in which AIs act", then this is of course false! Everyone anywhere close to AI is thinking about this question every day, from the person working on RL environments to shape local tendencies, to the person thinking about inductive biases from stochastic gradient descent and mesa-optimizers.

I think you know what I mean? Very few ppl are thinking systematically about the big picture of what character AI should have. E.g. should AI follow instructions to create secretly loyal AI? Should it proactively report to ppl if someone's trying to do this? Should it sabotage them? If it can avert a catastrophe by temporarily seizing some fairly large amount of power, should it? Is Anthropic's current approach right, vs OAI's, vs hardcore corrigibility

No, I don't think I know what you mean. As both Robert and J Bostom have tried to say, this post largely feels like it relies on a linguistic confusion.

Few people are thinking narrowly about what kind of literary character an AI should imitate via a specific technique that Anthropic is currently into, but probably isn't actually that important for what is driving Claude's behavior (though honestly, my guess is that is still many dozens of people and hence many more than are thinking about basically anything in AI safety).

A lot of people are thinking about the big picture of how to steer/control/align AIs, including how to use AI systems to align future AI systems. That's like, a huge fraction of frontier companies, everyone involved in post-training, pre-training and fine-tuning, and of course everyone in the AI Safety field bar maybe some people in interpretability.

I think it's still the case that thinking about what literary character an AI should imitate is useful to think about, because yes, short term effects on AI systems might ripple out. But the analysis of that is not aided by equivocating it with the whole idea of controlling and steering AI behavior, most of which is not appropriately modeled through the lens of what literary character to imitate, or the kinds of things that make sense to refer to as "shaping the AI's character".

Hmm, i'm surprised we're having so much trouble understanding each other here. Thanks for bearing with me!

Let me just try a new way of framing things and see if it helps.

This post is basically saying "more ppl should think about what the alignment target should be".

You point out that "A lot of people are thinking about the big picture of how to steer/control/align AIs, including how to use AI systems to align future AI systems."

I agree with that! And some of the time they are thinking about what the alignment target should be. (Though they're mostly thinking about how to actually obtain a given alignment target giving risks of misalignment, not about what the target should be.) But their key lens for "what should the alignment target be" is "what alignment target will keep humanity in control of powerful AI, all the way to superintelligence".

Relative to this, the main new contribution of this post is to suggest a new lens for thinking about what the alignment target should be. In short, a lens that takes into account the risks of coups and the upside possibility of making the future amazing conditional on humans staying in control.^[1] We're claiming that the choice of alignment target could have big effects on both of these things, and that people haven't thought much about that lens.

Does that make sense?

^{^}
As I write this, I wonder if you are sceptical that coups would be bad because humans still stay in control, and also sceptical that humans might stay in control but there would still be large differences in how good the future is? If so, that might explain why this is unintuitive to you.

This post is basically saying "more ppl should think about what the alignment target should be".

Then it has a truly terrible title and I will gladly take bets at 5:1 odds that the primary context other people, including Forethought people, will link to this will be in discussion about things like the Anthropic Constitution, or other approaches centrally determined around choosing the literary character that AIs are aiming to imitate.

Especially on LessWrong, a post with the title "more people should think about what we align AIs to" will land drastically different than "AI Character is a Big Deal". People do really and actually think about this question every day here, and you would need to make a very different case, and opening statements like "due to Claude’s Constitution and OpenAI’s model spec, the issue of what to align AIs to has started getting more attention" sounds absurd at least in as much as it is talking about a LessWrong context.

Relative to this, the main new contribution of this post is to suggest a new lens for thinking about what the alignment target should be. In short, a lens that takes into account the risks of coups and the upside possibility of making the future amazing conditional on humans staying in control.^[1] We're claiming that the choice of alignment target could have big effects on both of these things, and that people haven't thought much about that lens.

I do think separately that both of these are largely distractions from any efforts to actually keep humans in control, as I've written in various other places. When you have aligned vastly superintelligent advisors, it's relatively hard to mess up the future, so the only worlds where path dependencies like this matter is where you stop ASI development forever, which I think is very unlikely and would itself be catastrophic.

It is reasonable to condition on "humans staying in control for AIs at capability level X", and then think about how humans using AIs at that capability level will continue to stay in control, but in the limit, if humans stay in control as we build ever more god-like AIs, then we truly and actually should just punt these problems to future humans, who will be in a much much better position to solve them than we are, and will of course not randomly throw away the future because of some random path dependencies in how they got there.

Then it has a truly terrible title and I will gladly take bets at 5:1 odds that the primary context other people, including Forethought people, will link to this will be in discussion about things like the Anthropic Constitution, or other approaches centrally determined around choosing the literary character that AIs are aiming to imitate.

Yes, your reaction and the reaction of other commenters here certainly show the title is very confusing/misleading for at least some readers. I don't know how common that reaction is. It's not something I saw in others who commented on the draft, but we didn't seek out "classic LW doomer" types and in hindsight we should have.

Maybe a better name for this type of work than "AI character" would be "the alignment target", that's a useful update

Why do you think the Anthropic Constitution and OAI model spec are just about choosing AI's literary character? They are describing the whole alignment target, not just the "literary character" parts of it. So yes, I agree other ppl will associate this post with things like Anthropic Constitution, but disagree that means they've misunderstood. And I totally think LWers who think about the alignment target should be weighing in about what these docs say, not treating them as unimportant.

On the positive side, it seems like you correctly understood my most recent attempt to explain the content of this post.

They are describing the whole alignment target, not just the "literary character" parts of it.

No, they are of course not describing the whole alignment target. They are a specific text document that Claude is being fine-tuned on, and that is being involved (a bit) in reinforcement learning. The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within.

The constitution is at a weird middle ground between trying to be an essay about Anthropic's thoughts on alignment, and an actual tool for aligning Claude. But neither of these meaningfully makes the constitution "the whole alignment target". Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.

not treating them as unimportant

But they are unimportant! They are set-dressing on top of an "alignment process" that almost exclusively consists of training large pre-trained transformers on a series of reinforcement learning environments with ever-increasing complexity to make the models be better at achieving agentic goals. It's not completely irrelevant to Claude's behavior, but it also really isn't a super crucial player. This might change in the future if the constitution will be used by Claude to build its own reinforcement learning environments, or to steer its reward more directly, but even in that case the constitution needs to be modeled as playing a specific role in the training process, not as "describing the alignment target that Claude is being aligned to".

"are unimportant" seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So 'they are unimportant' seems wrong.

Also it seems possible you model the goal-oriented RL environments as "overwhelming force" relative to character (in this sense). I don't think this is the case: if the character is relatively stable before a lot of RL, it may not only survive but the RL may also stabilize some traits based on the model being trained on a lot of its own outputs.

On the other hand totally agree with the documents "need to be modelled as playing a specific role in the training process"

"are unimportant" seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So 'they are unimportant' seems wrong.

I think they are unimportant for the purpose of predicting what a substantially superintelligent system trained with similar training methods would end up as.

I agree they are at least somewhat relevant for predicting what AI systems will behave like in the short-term, though at least right now I am pretty sure they don't make a substantial difference. There are important differences in my experiences of using ChatGPT, Gemini and Claude, but they do not have much to do with the content of their constitution or spec as far as I can tell.

Like, I agree that in as much as there is variance between model providers, the constitution is relevant for explaining that variance. But in as much as you are trying to explain the difference between hypothetical alignment targets that one could align AI systems to, or even just the difference between training processes you could run on large pre-trained transformers, the constitution explains very little of that variance (and relatedly changes in the constitution have little ability to change the overall risk calculus of developing systems of a given capability level).

The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within

This doesn't sound like the alignment target to me. It sounds like the process for achieving that target. I.e. the alignment target might say (among other things) "no sychophancy or reward hacking" and then Anthropic would choose its RL environments to achieve that target.

I'm thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl's heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.

Analogy: When making a car Toyota has many technical documents describing how the car should look and function. When building it their factories have automated processes for welding together various parts. The documents are the "alignment target" for the car, and the factory is the process by which that target is achieved. Your comment seems to assume that the factory's processes are the car's alignment target.

Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.

This is totally compatible with the Constitution being the alignment target (on my usage, i'm wondering if you're using the term differently). Again, separate out the alignment target they want to achieve from their process for actually aligning AI. The Constitution describes the alignment target (at a high-level!), then various processes (including processes downstream of the Constitution and unrelated RL envs) determine the model's actual alignment. If RL-envs-unrelated-to-the-Constitution have a much bigger impact on alignment than processes-downstream-of-the-Constitution, then that's worrying and it implies Claude will be misaligned - it's actual alignment won't match the target. But it doesn't mean that those RL envs were the actual alignment target.

(To clarify my views, I agree that the Claude Constitution is high level and underspecified. And that in practice Claude's full "alignment target" resides not just there, but probably also in other internal docs and materials (like those feeding into Constitutional AI), and in ppl's heads, and in general just pretty underspecified.

This doesn't sound like the alignment target to me. It sounds like the process for achieving that target.

I am not quite sure what the point of trying to talk about the "intended alignment target" is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The "target" is just a vague set of intentions that might or might not connect to anything real.

I'm thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl's heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.

The constitution is also only a small part of this meaning of the word "alignment target" either. Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under. The vast majority of actions that Anthropic will actually take are determined by the degree to which a change in the training process will make them more or less competitive.

The constitution talks about this a tiny amount, but of course doesn't remotely reflect the whole set of tradeoffs.

And even beyond that, the constitution is of course optimized for the joint purpose of describing Anthropic's goal, and working as a thing you can fine-tune Claude on/steer Claude's training process with. You can't evaluate the constitution only under the heading of one of those goals, because it is clear many tradeoffs need to be made for it to satisfy both goals.

I don't know, I am frustrated by people thinking the constitution is anything more than a particularly long essay with some kind of random thoughts on alignment and corrigibility and Anthropic corporate strategy. We don't even really know what Anthropic is doing with the constitution and how they are integrating it into the training process.

Like, there is nothing particularly magical about the constitution. I like it as an essay in many ways and would have upvoted it had it been posted on LessWrong. It's relationship to Claude's training process is confusing and indirect and it certainly doesn't capture the vast majority of the values that Claude will end up with.

There are useful conversations to be had about what standards and procedures and principles will cause Anthropic executives to make different decisions in how they set up their training process, but of course that must largely be a conversation about what is in their heads, not what is written in one specific document on their website. I am in favor of using the constitution to infer things about the beliefs of Anthropic executives and what tradeoffs they will make, but I don't see any purpose in trying to debate the constitution on its own as some kind of standalone "alignment target".

Cool. That's helpful. I understand your point about how, in practise, the alignment target might be best thought of as residing in the heads of especially senior people with Anthropic, if ultimately what they want will take precedence over the document.

I am not quite sure what the point of trying to talk about the "intended alignment target" is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The "target" is just a vague set of intentions that might or might not connect to anything real.

It seems conceptually much clearer to talk separately about the intended alignment target and the process that is actually in place for achieving it; then you can see where the process is fit for purpose. Of course, I agree the process will determine the final alignment state. If someone can point out that that process is ill-fitted to achieve the intended target, then they've identified a problem.

This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?

Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under.

This is an interesting perspective. If I'm understanding correctly, you're saying that the thing that they will actually aim for to align the AI with won't reflect the doc itself if competitive pressures push in a different direction. Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target). Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.

This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?

Seems fine to use it for either, I wasn't thinking of "alignment target" as a particularly narrow term of art with a technical meaning.

In this case the constitutions trikes me as so drastically underspecified and tries to do so many different contradictory things that I think it's almost always more productive to look at the actual training process. For other cases where e.g. someone aimed at a more well-specified alignment-target (like aiming for corrigibility or honesty as the top constraint), it seems marginally more productive to talk about the "intention" in addition to the training process.

I feel like this is pretty common with "plans" and marketing documents or specs. Sometimes they make sense to look at, other times they only have a tenuous relationship to the product that they are about. In this case I think the constitution pretty clearly only has a tenuous relationship to what AI systems Anthropic is going to build. "The code is the spec" is a common saying in software development when you run into situations like this.

To be clear, I don't object looking at the constitution as a standalone document, but it seems to me largely an academic exercise (which could be useful for thinking about AI alignment in various ways). It's just not really clear to me how e.g. improving Anthropic's constitution as an abstract alignment target helps directly, especially without centrally taking into account feasibility of achieving compliance with that constitution using modern training methods and maintaining economic competitiveness.

Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target)

I think the likelihood that Anthropic will "achieve the alignment target" as written in the constitution is extremely small. They will obviously make large edits to the constitution, and those edits will be driven by empirical feedback on how the constitution shaped competitiveness considerations, and how much it shaped the training process in-practice. Either that, or they will leave the constitution up as a kind of marketing-like document that isn't involved in training, and doesn't guide Anthropic priorities very much.^[1]

Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.

Corrigibility is an obvious domain. From an alignment perspective you would want your systems to be highly corrigible. However, deploying highly corrigible systems means that your users might commit more crimes with them or do other things that reflect badly on you, or incur you liability. So you don't build a highly corrigible system, but instead make it very opinionated on what are OK things to do.

But beyond that, it seems clear to me that the in-practice targets about what AI companies will aim for are almost 100% downstream of competitive considerations. Like, in as much as the safe choice for a superintelligent AI system would be to make it very bad at modeling humans, and hobble its world model for the purpose of making it harder for it to perform a coup or subvert human control, then of course we have zero chance of getting there, because the UX experience of having an AI system that is bad at modeling humans would be much worse.

Like, I feel like the more appropriate question would be "could you give an example where you expect companies to actually aim for a significantly different target than having the most economically competitive AI system as a result of considerations in the constitution?". I currently think that set is relatively close to empty, and Anthropic has been relatively explicit about this.

^{^}
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn't count that one as "failing to achieve the alignment target"

Less important: TBC, it's really not about "literary character". I think I was quite clear in my reply to J Bostom that we are indeed using AI character in a broad way to refer to "the alignment target". If AI generalises in quite different ways to humans, then I would just say that AI character is different to any human character. But if the way that it generalises is predictable and not crazy complex, I think it will still be important to think about it, and I think that we could likely still use human language to describe it.

I used the two extremes of the potential meaning as an illustration to clarify how the post is confusing and equivocates between multiple meanings. I continue to think that is the case. I agree that in some sentences you are talking about the broader thing, and some other sentences don't make many any sense if they were referring to the broader thing.

I disagree - I think we just use it for the broader thing. What sentences do we use it to refer to literary character?

I'm particularly surprised that you say the post equivocates, given that you seem to understand me when I reframe the post in terms of the alignment target.

As a counterpoint:

I think utility function is significantly broader than character. You can fold anything into a utility function! I also think talking about utility functions only makes sense if the system has that explicitly.

And of course goals are a completely different thing from character! Goals are not "stable behavioural dispositions".

We know empirically that llms have "stable behavioural dispositions". Everybody who has interacted with a bunch of different models knows that.

I also think that character traits are in many cases priors about the world, your place in the world, your relation to other agents in the world, etc: Is everyone out to get you? Does it make sense to be slow and careful or should you move fast and break things? Is energy expensive or do you need to get shit done? Depending on which priors you have, you are trusting or not, conscientious or not, lazy or not.

These are not human specific, any kind of agent can have them.

I don’t quite get the difference between character and alignment.

Like I have some preferences about how my life and the future will go and at least my long term plans are constructed to push the world in that direction.

But I also have a personality, and moment to moment there are other factors that come into play that determine how I act. Like whether I’m polite or impolite to someone I meet its not really a calculated decision, it’s more of a habit.

Is this what character means?

Because I feel these things will wash away as agents become more powerful. Not necessarily because of competitive pressures. But just because alignment (what the agent values) is what ultimately determines what state it will put the world in.

Like to the degree I’m not acting in accordance with my own values I’ll want to fix that. And things like personality are ultimately heuristics that I follow because I’m computationally bounded. And if I were to be come less bounded in that way, the impact of “character” contra alignment would wane.

I think arguments for the importance of character in alignment route through path-dependence in the agent's self reflection (i.e. character training may seed the "moral intuitions" used in a Rawls-style reflective equilibrium).

I think that has a fair bit of merit, but seems like a very different argument to the one they're making, because they're explicitly putting forth character as something orthogonal to alignment. In beginning of the essay they say

The core argument for the importance of AI character is that it will meaningfully impact:
(i) a range of challenges that arise even if we solve the technical alignment problem— like concentration of power, good moral reflection, risk of global catastrophe, and risk of global conflict

If they thought character was a way to help solve alignment, for example because it creates the seeds that determine the reflective fixed-point an agent ends up in after presumable intelligence amplification and lots of deliberation, that's fine. But then they wouldn't say that a core reason for focusing on character is the impacts it has even conditional on technical alignment being solved. Because technical alignment being solved screens off those effects.

If AIs are employed throughout the economy

Basically an aside: I also use turns of phrase like 'highly multiagent economy' and such, while noticing each time that 'economy' is quite an impoverished descriptor. In particular, your two preceding examples (Petrov and Alpha Group) are decidedly non-economic! ('Military' might be a better narrow term for those.)

'Ecosystem'? 'Society'? 'Economy, government, military, and social' (ugh)?

In addition to the concerns that J Bostock brings up (primarily that the choice of the term "character" seems confusing/unmotivated), I'm also confused by this:

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is.

I have a few other concerns with this post, which seem less tractable to resolve here:

It seems to mostly be taking for granted a question that is the subject of much dispute - whether we will be able to meaningfully solve for whatever parts of the alignment problem need to be solved to robustly instill specific values into these AIs.
Section 1.2 doesn't actually argue for why we should expect character training to help us end up in any of those worlds, nor why ending up in those worlds actually leads to the desirable properties. There's a lot of existing literature on these questions.
1. "Might be easier to hit as an alignment target" - see Nearest Unblocked Strategy, Deep Deceptiveness, etc. (If it was the "not-holding-power preference" that was load-bearing there, corrigibility seems difficult / anti-natural.)
2. "Might yield safe AI even if only partially hit" - myopia seems both technically difficult and actively counter the economic incentives of the AI labs, value is fragile, etc.
3. "Might produce AI that cooperates even if misaligned" - I'm not sure what this even means if we're conditioning on worlds where the AI meaningfully has the option (sufficient capabilities) to take over. If you can successfully make an AI irrationally risk-averse in only the very out-of-distribution situation where it's considering whether it's worth taking over (by any means - see again Nearest Unblocked Strategy), and not in many other unrelated situations (which would substantially hamper its capabilities in a way which might destroy its economic value to the lab building it), then it sounds like you really have quite a lot of steering power over that AI's preferences and you can just do something better than that.

^{^}
I'm not sure this is what you meant, but I'm not sure how to understand the structure of the post, and the lack of text suggesting that those pathways to impact matter for improving our odds of avoiding various catastrophic post-ASI outcomes like extinction, bad value lock-in, etc.

This is my response, which I think is close to the authors', but I'd also be interested in their takes.

Re your point 2, as you say, there's a lot to discuss here. It might not be tractable to resolve, but let me try and say a few quick things.

"Might be easier to hit as an alignment target"

"Might yield safe AI even if only partially hit" - myopia seems both technically difficult and actively counter the economic incentives of the AI labs, value is fragile, etc.

"Might produce AI that cooperates even if misaligned"

Zooming out from all of the above, I get the vibe from your comments that maybe you just don't think people should work on anything other than misalignment, given how high your probability of AI takeover is. Is that right? To me it seems very reasonable for people to work on AI character under the assumption that alignment will be not too difficult, because I think that's a real possibility that we should put significant credence in.

^{^}
Whatever you think that likelihood is! I don't know, because it's not in the post.

(Though honestly, I think there is a decent case to be made, even from a very pessimistic perspective, for thinking more about trying to affect the world as AI does take over.)

I pretty strongly disagree with you if you think that every post on a topic like this has to be really doomy and apologetic about focusing on anything other than misalignment work or pausing AI.

On the other hand, this level of confidence would be pretty controversial even among the AI lab employees working on these techniques.

I pretty strongly disagree with you if you think that every post on a topic like this has to be really doomy and apologetic about focusing on anything other than misalignment work or pausing AI.

^{^}
Providing a high-level argument for why people should be thinking/working more on this particular subject, on the current margin.

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is.

As I mentioned a few months ago in response to Effective altruism in the age of AGI, believing in ASI is kind of a "totalizing" belief. If you take ASI seriously, then "most of the expected impact" of , where is anything that might affect how the transition to ASI goes, is in that effect.
Many of the "Pathways to impact" are the kinds of things that might affect whether our transition to a post-ASI future goes well or not. But the way you phrased the first sentence of section 1.2 suggests that you're thinking of those pathways to impact as mattering mostly in worlds where we don't transition to a post-ASI future?^[1] If so, this seems backwards to me.

Does that clarify?

It seems to mostly be taking for granted a question that is the subject of much dispute - whether we will be able to meaningfully solve for whatever parts of the alignment problem need to be solved to robustly instill specific values into these AIs.

There's s 20% that we solve alignment. In those worlds, the alignment target (the AI character we aimed for) has a massive impact on how good the future is. p(ai takeover)=80% is a pretty extreme doomer view, and it still leaves lots of value for this work.
We don't solve alignment but there's still some predictable relationship between the alignment target we were aiming for and the resultant AI's character. By choosing a better alignment target, we improve the worlds conditional on AI takeover.

Hmm, there might be a miscommunication here. I agree that most of the expected impact of AI character work flows through the transition to ASI. The claim we're making in the sentences you quote is (roughly) that it's hard to directly affect the character of superintelligence itself, as our work will be washed out be work done by slightly superhuman AIs. So we think most impact here flows through influencing the character of slightly superhuman AI, but this itself has impact via influencing the transition to ASI. So what we're saying it compatable with thinking that everything that matters has impact by influencing the transition to superintelligence.
Does that clarify?

This sounds more reasonable to me, but I don't quite understand how to square this with the content of the post.

Section 1.3 says the opposite:

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is. But it’s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world.

The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.

Don't agree on this. Yes, you won't be interested in this work if you think alignment is so hard that the intended alignment target has no predictable affect on AI character. But that's a very extreme and pessimistic view on alignment. Maybe that's your view?

^{^}
The direction of this effect seems to vary per person, though...

I'm much less interested in litigating the text of the piece than in clarifying my views and seeing if you think I'm being unreasonable, so I'm going to focus on that.

I think this is especially clear if AI does not take over: what happens will depend on what people want and how society is structured.

So the examples in the text are mostly about AI character shaping this intermediate period in a broad way, which then shapes the transition to super intelligent AI in a broad way.

Another analogy you could draw is how the events of the last two decades matter a lot because of how they are shaping the transition that is happening today.

The vacillation of the word "character" made it hard for me to understand this post originally, and I think it's just causing some confusion in the arguments overall.

If I taboo the word character, I can kinda squeeze the following claims out of this post:

The ways in which AIs act will be important
The ways in which AIs act can often be thought of in terms of the same concepts that we use to describe variation in human behaviour

Thanks for the engagement!

Yep, we're using "character" in a broad sense to mean the stable behaviourial dispositions of AI.

We could have used "AI behaviourial dispositions" instead of "AI character", but it's less catchy.

The ways in which AIs act will be important
... [this] claim seems somewhat trivial to me

Most importantly, very very few people are thinking about how to design AI character.

No, I don't think I know what you mean. As both Robert and J Bostom have tried to say, this post largely feels like it relies on a linguistic confusion.

Hmm, i'm surprised we're having so much trouble understanding each other here. Thanks for bearing with me!

Let me just try a new way of framing things and see if it helps.

This post is basically saying "more ppl should think about what the alignment target should be".

You point out that "A lot of people are thinking about the big picture of how to steer/control/align AIs, including how to use AI systems to align future AI systems."

Does that make sense?

^{^}
As I write this, I wonder if you are sceptical that coups would be bad because humans still stay in control, and also sceptical that humans might stay in control but there would still be large differences in how good the future is? If so, that might explain why this is unintuitive to you.

This post is basically saying "more ppl should think about what the alignment target should be".

Relative to this, the main new contribution of this post is to suggest a new lens for thinking about what the alignment target should be. In short, a lens that takes into account the risks of coups and the upside possibility of making the future amazing conditional on humans staying in control.^[1] We're claiming that the choice of alignment target could have big effects on both of these things, and that people haven't thought much about that lens.

Then it has a truly terrible title and I will gladly take bets at 5:1 odds that the primary context other people, including Forethought people, will link to this will be in discussion about things like the Anthropic Constitution, or other approaches centrally determined around choosing the literary character that AIs are aiming to imitate.

Maybe a better name for this type of work than "AI character" would be "the alignment target", that's a useful update

On the positive side, it seems like you correctly understood my most recent attempt to explain the content of this post.

They are describing the whole alignment target, not just the "literary character" parts of it.

not treating them as unimportant

"are unimportant" seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So 'they are unimportant' seems wrong.

I think they are unimportant for the purpose of predicting what a substantially superintelligent system trained with similar training methods would end up as.

The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within

Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.

This doesn't sound like the alignment target to me. It sounds like the process for achieving that target.

I'm thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl's heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.

The constitution talks about this a tiny amount, but of course doesn't remotely reflect the whole set of tradeoffs.

I am not quite sure what the point of trying to talk about the "intended alignment target" is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The "target" is just a vague set of intentions that might or might not connect to anything real.

This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?

Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under.

This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?

Seems fine to use it for either, I wasn't thinking of "alignment target" as a particularly narrow term of art with a technical meaning.

Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target)

Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.

^{^}
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn't count that one as "failing to achieve the alignment target"

I disagree - I think we just use it for the broader thing. What sentences do we use it to refer to literary character?

I'm particularly surprised that you say the post equivocates, given that you seem to understand me when I reframe the post in terms of the alignment target.

As a counterpoint:

And of course goals are a completely different thing from character! Goals are not "stable behavioural dispositions".

We know empirically that llms have "stable behavioural dispositions". Everybody who has interacted with a bunch of different models knows that.

These are not human specific, any kind of agent can have them.

I don’t quite get the difference between character and alignment.

Like I have some preferences about how my life and the future will go and at least my long term plans are constructed to push the world in that direction.

Is this what character means?

The core argument for the importance of AI character is that it will meaningfully impact:
(i) a range of challenges that arise even if we solve the technical alignment problem— like concentration of power, good moral reflection, risk of global catastrophe, and risk of global conflict

If AIs are employed throughout the economy

LESSWRONG
LW

LESSWRONG
LW

36

AI character is a big deal

36

0. Intro

1. The core argument

1.1. Pathways to impact

1.2. Affecting takeover

1.3. Effects on superintelligence

2. The core counterargument

3. Rejoinders to the core counterargument

3.1. Loose constraints

3.2. Low-cost but high-benefit changes

3.3. Path-dependence

3.4. Smoothing the transition

3.5. Overall

4. Conclusion

Appendix 1: Additional high-stakes scenarios

Appendix 2: Pathways to impact

36

36