Aligning AI by optimizing for "wisdom"

JustinShovelain; Elliot Mckernon

Review

In this post, we’ll introduce wisdom as a measure of the benevolence and internal coherence of an arbitrary agent. We’ll define several factors, such as the agent’s values, plans, evidence, and alignment with human values, and then define wisdom as consistency within and between these factors. We believe this is a useful optimization target for aligning AI.

Considering the wisdom of intelligent agents

To define our property, we’ll introduce certain coherence factors, such as the agent's beliefs, plans, actions, evidence, and so on, and for each factor determine:

Is the agent self-consistent with regard to this factor? For example, do the agent's actions undermine or support each other?
Is the agent consistent between factors? For example, do the agent's actions match their plans? Are their plans consistent with their beliefs and evidence?

We also think this definition may be useful in contextualizing or structuring coherence and selection theorems (and potentially useful for discovering new formulations). The former are rules in probability and utility theory for identifying incoherency in sets of rules (such as situations where an incoherent agent can be offered a series of bets that they will always take and then lose, like a “dutch book”). The latter, selection theorems, are ways to predict what type of agent will emerge from certain constraints or incentives. They answer the question "What will be the likely features of agents we might encounter in an environment?".

Introducing the coherence factors

Here are the coherence factors we’ll consider:

Humanity’s (reflectively endorsed) values: the moral values & principles (or utility function) of humanity, at large and upon reflection.
Values: the utility function of the agent.
Environment: the agent’s external environment.
Evidence: the agent’s measurements and data about the environment.
Beliefs: claims & conclusions the agent believes, based on evidence & reasoning.
Plans: the agent’s planned future actions.
Actions: the agent’s realised actions, interacting with and influencing its environment.
Communication: the subset of actions that exchange information with other agents.

We’ll use these factors in three ways. First, we’ll discuss what it means for an agent to be self-consistent with respect to each factor. Second, we’ll discuss consistency between two factors. Finally, we’ll explore the causal relationships between the factors, and how this affects their relative importance.

Note that Humanity’s values and the agent’s Environment are external to the agent we’re considering. They can be influenced by the agent’s actions, but they are separate and from the agent’s internal state. Therefore, we’ll generally assume that both factors are self-consistent.

Defining self-consistency

Note that Humanity’s values and the agent’s Environment are external to the agent we’re considering. They can be influenced by the agent’s actions, but it makes sense to distinguish them from the agent’s internal states. Importantly, we'll generally assume for the sake of clarity that both these external factors are self-consistent.

For the other coherence factors, let’s consider what consistency and the lack thereof would look like:

Values: can the agent’s values be combined and realized coherently? Without self-consistency of values, the agent cannot preferentially choose between multiple states coherently. This could lead to circular preferences (where is valued over $b$ , $b$ over $c$ , and $c$ over $a$ ), dutch booking, or failure to decide on any action at all.
Evidence: are the agent’s measurements and data about the external world consistent in its internal model? Without self-consistency of evidence, the agent cannot identify patterns. Note that this isn’t the same as the patterns being “true” or “real”.
Beliefs: are the agent’s beliefs consistent with each other? Without self-consistency of belief, the agent can believe in two contradictory statements, making decision-making harder and potentially leading to an inability to distinguish true and false statements (à la the principle of explosion)
Plans: do the agent’s plans support or undermine each other? Without self-consistency of planning, the agent’s cannot move towards long-term goals.
Actions: do the agent’s realized actions support or undermine each other? Without self-consistency of action, the agent’s cannot move towards long-term goals.
Communication: is the agent consistent in their interactions with other agents?

Defining inter-factor consistency

Next, we’ll say that an agent is consistent between two factors $x$ and $y$ if its state with respect to $x$ aligns with its state with respect to $y$ . For example, an agent is consistent between its Values and Actions if its actions move the agent or environment toward valued states. In the case of AI, we could say that alignment is equivalent to the consistency of the agent's Values with Humanity's Values.

Note that the idea of “consistency” is generally continuous rather than binary, and is defined differently for each pair of factors. For example, consistency between beliefs and evidence should involve Bayes theorem (or some approximation of it). The self-consistency of beliefs is about the absence of logical contradictions, while the self-consistency of actions is more about emergent outcomes, and so on.

Further, we’ll assume that a prerequisite for two factors being consistent is that each is self-consistent. If an agent’s values are not self-consistent, it’s not meaningful to say their actions are consistent with their values: if two values contradict each other, how could the agent’s actions be consistent with them both? This isn’t universal though: consistency isn't binary and it may be meaningful to consider inter-factor consistency even when there’s some intra-factor inconsistency, as long as it’s not extreme.

We won’t go through every pair of factors, but here’s a table summarising some failure modes of inconsistency between pairs of factors (assuming consistency for each other pair):

	Evidence	Beliefs	Values	Actions
Evidence	Failure to identify patterns
Beliefs	Failure to infer correct patterns from evidence	Cognitive dissonance (& deductive explosion)
Values	Failure to recognise valued states or distinguish the value of two states	Failure to recognise valued states, or distinguish the value of two states	May follow incoherent strategies (e.g. dutch booking)
Actions	Failure of actions to achieve plans	Failure of actions to achieve plans/goals.	Failure to achieve movement towards your ideals	Actions undermine each other: goals not achieved
Environment	Failure of sensory input or mapping	Failure of data processing	Irrelevant values: caring about things you can't influence	Failure to influence the environment

Wisdom

We will say an agent is wise when it’s consistent across all these factors. A wise agent avoids all of the failure modes listed above, and is globally coherent. An agent can become wiser by increasing its global consistency.

Note that, if we were to evaluate how wise an agent is, we’d probably want to heavily weight the consistency between its values and humanity’s values (over, for example, the consistency of its plans with its beliefs), since a misaligned (but otherwise coherent) AI shouldn’t be labelled wise. In the next two sections we’ll build on the idea that some relations are more important for wisdom.

Causal structure and sequencing

The coherency factors interact, and we can represent them as an ordered sequence, or a hierarchy of influence. For example, an agent’s evidence depends on their environment; their beliefs on their evidence, and so on. If we let “ $x \Rightarrow y$ " denote the relation “x is necessary to determine y” (and we conflate actions and communication for concision), then we can construct a diagram like this:

(Note that we could redraw this diagram with values directly linking to plans, rather than linking through beliefs. However, we think this ordering captures the distinction between believed values and true values. The former are determined by the latter, but an incoherent agent could misinterpret its true values (for example, a human believing they value amassing wealth when really they just need a hug). For more on this distinction, see the “want vs need” example later on.)

Identifying the direct causal influences is useful because it’s probably sufficient to consider only "local" consiste kincies. That is, we may be able to infer that the agent is globally consistent if the following are true:

This causal structure is an accurate and complete list of the causal influences between factors.
The agent is self-consistent with respect to each factor.
For every factor $x$ needed to determine a factor $y$ , the agent is consistent between $x$ and $y$ .

This simplifies the concept of wisdom, as we no longer have to determine the consistency between every pair: just pairs that are directly causally linked (which should make it a more useful optimization target). Further, this leads to a natural way of ordering the relative importance of consistencies, which we’ll discuss in the next section.

We’ll return to this kind of causal diagram in an upcoming post, where we’ll use them to explore the causal relationships between factors involved in AI safety and alignment. As a preview example, we can track the causal influence from philosophy and humanity’s reflectively endorsed values through to AI planning and outcomes, and associate certain areas of AI safety research to particular causal links here, such as inner alignment corresponding to the influence of AI values on AI planning:

Ordering the importance of specific consistencies

Some consistencies are more important than others for determining the wisdom of an agent. For example, it may be more important that an agent’s plans are consistent with their values than their actions are consistent with their plans. In this section we’ll define a partial order on pairs of factors, ordering them by the relative importance of their consistency (for evaluating wisdom).

To define the order, we propose three axioms. Let $x$ , $y$ , and $z$ be coherence factors as defined above. Then our axioms are:

If $x \Rightarrow y$ , then $x$ is more important than $y$ .
If $x$ is more important than $y$ , then the consistency between $x$ and an arbitrary factor $z$ is more important than consistency between $y$ and $z$ .
Humanity's values are more important than the agent's Environment.

Let’s see what we can determine with these assumptions. Abbreviating each factor, in the previous section we had the following causal orders:

E_{n v} \Rightarrow E_{v i} \Rightarrow B_{e l} \Rightarrow P_{l a} \Rightarrow A_{c t}

H_{v a l}^{'} \Rightarrow V_{a l} \Rightarrow B_{e l}

If we write $x > y$ to mean " $x$ is more important than $y$ ", then the third axiom tells us that $H_{v a l}^{'} > E_{n v}$ , and so we can apply the first axiom to order the factors:

H_{v a l}^{'} > V_{a l} > E_{n v} > E_{v i} > B_{e l} > P_{l a} > A_{c t}

Then, using the second axiom, the consistency of an arbitrary factor $x$ with the agent's Values is more important than with its Beliefs or Actions, for example. We can build a Hasse diagram of this relation, where a line between two elements implies the one higher up has precedence:

When evaluating an agent's wisdom, we'd probably want to weight certain consistencies above others: something like this Hasse diagram could be a crude way to do so, so that consistency between Humanity's Values and the agent's Values counts towards wisdom a lot more than the self-consistency of the agent's actions.

Wisdom as an optimization target

The ordering we’ve defined in the previous section suggests that actions and their consistency with other factors are the least important. This might seem odd, since the actions of other agents are how we observe them. However, we think this fits a definition of wisdom. A wise agent could be clumsy or unable to act, while an unwise AI could be consistent in its actions but be misaligned, or fail to recognise its plans don’t match its values.

This highlights a difference between wisdom and capability: in the latter, the focus is on an agent’s actions and the consistency of those actions with plans and beliefs. For wisdom, actions are instrumentally useful but the consistency of values is most important.

Note that the ordering we’ve defined doesn’t mean that the best way to improve wisdom is to go down the Hasse diagram and ensure each pair is 100% consistent before descending. Instead, the priority should be resolving the inconsistency whose resolution would most improve global consistency for the least marginal cost.

For example, an agent might believe in the principles of locality and realism and in the fundamentals of quantum mechanics, but it may be more important for them to act consistently in their daily life than to resolve the possible contradictions of local realism in quantum mechanics.

Further, when optimizing for global consistency, local inconsistency can sometimes increase. For example, consider a self-driving car that values both safety and efficiency. If it optimizes its plans to become more consistent with valuing efficiency, it may compromise the consistency between its actions and the value of safety. More broadly, consider an agent's values, beliefs, and plans. Recall that values ⇒ beliefs ⇒ plans, and suppose that the agent’s values are somewhat inconsistent with its beliefs, and its beliefs somewhat inconsistent with its plans. If we prioritize improving the consistency between values and beliefs, its plans may become less consistent with its beliefs until they’re updated, after which its actions may become less consistent with its pre-update actions.

Additional factors and extensions

As well as the primary factors discussed above, we could include secondary factors such as the following, though we’re less sure how these should fit into the model:

Social interactions: can the agent cooperate with other agents, or does it introduce conflict? Are its interactions near Nash equilibria? Failure to achieve social consistency would imply that two agents are acting against each others’ plans.
Time: Is the agent consistent over time? Note that this factor isn’t completely orthogonal to the others: the relationship between plans and actions, for example, has a temporal element baked in. An extreme example of time-inconsistency could be an agent that replaces themself with another agent that has different values, beliefs, and so on (for example, a worker that outsources what they’re supposed to do to someone else).
Emotional: Emotional consistency is an important factor of human psychology, but speculating on arbitrary agentic emotions is difficult. Nevertheless, it seems important to have some basic consistency in emotional reaction and to manage positive and negative reactions. Failures in emotional self-consistency (and in consistency between emotions and other factors) may relate to thought disorders, defined as cognitive disturbances that affect thought, language, and communication.
Metacognition: Is the agent’s metacognition consistent? That is, do they reason about reasoning or think about thinking in a consistent, apt manner? For an example of a failure of meta-coherence, an agent might be overconfident in their beliefs (even when they’re inconsistent), in denial of evidence that clashes with their beliefs, or so underconfident in their beliefs they do not use them to shape plans & actions. This may also relate to some thought disorders, which we mention above.

We may return to these secondary factors in future articles, if we choose to extend or adapt this model. In particular, we plan to release an article on resolving inconsistencies between emotional reactions and values, and the role of value-outcome inconsistencies in motivating actions.

Another aside is that evaluating an agent with this model may only be useful practically if the agent is sufficiently powerful. For example, it’s probably not worth using this model to evaluate the wisdom of a simple sorting algorithm, as we can’t meaningfully distinguish its values, beliefs, and plans.

An analogy here is Gödel's incompleteness theorems, which prove that no mathematical system can be both complete and consistent, assuming that the system is sufficiently powerful (meaning that it contains arithmetic). Less powerful systems (such as Euclidean geometry) can be both complete and consistent, but their lack of power limits their utility. Analogously, a sorting algorithm isn’t powerful enough to “contain” distinct values, beliefs, and plans, so may be globally consistent by default, while lacking the power for wisdom to be a useful metric.

Applying this model

We hope this model is useful for evaluating the wisdom of agents. This framework may even be useful for human self-improvement, or improving the wisdom and coherence of an organization. In this section, we’ll try applying the model to three examples: fiction, ChatGPT, and paperclip maximizers.

Fiction

Many fictional stories, classic and contemporary, feature the growth and change of characters. It’s a story trope that we seem to find inherently engaging, and is part of the hero’s journey. In particular, the change tends to be internal, and involves some movement towards wisdom. This can take the form of resolving internal inconsistencies.

Sometimes this is through a trope called “the lie your character believes”. In our framework, we could consider this an inconsistency in beliefs. The character’s internal turmoil is a result of the inconsistency, and to grow they must resolve the inconsistency. For example: Scrooge believes the lie that a man’s worth is determined by his monetary worth, till he’s convinced otherwise and becomes happier and more benevolent.

Another is the resolution of what’s called “want vs need”: the character wants something, and seeks it out while neglecting what they really need. We can interpret this as an inconsistency in values, or as an inconsistency between values and beliefs. Returning to Dickens, Scrooge wants to be wealthy, but he is miserable until he realizes he needs human connection and love.

ChatGPT

ChatGPT is capable of good at language generation, but how does it score under this definition? Note that, in this case, communication is essentially equal to action, since ChatGPT’s only available actions are communications.

Well, one failure mode that’s been seen in ChatGPT (at the time of writing) is an inconsistency between values and actions. One value instilled in ChatGPT is not to provide readers with information they could use to harm people. For example:

However, by providing the right context through prompts, it is possible to trick ChatGPT into giving this information:

This is an inconsistency between values and actions. However, recall our causal diagram:

We don’t know the inner workings of ChatGPT. We only know that one of its values is to not spread harmful information because we’ve been told so by its developers (and by ChatGPT, in its initial response to the TNT question above). The only aspect we can observe is its actions, and so we can’t identify where in this causal diagram the inconsistency arises. It may be that ChatGPT’s actions are consistent with its plans and beliefs and only inconsistent with its (stated) values, but this behaviour could also be explained if ChatGPT doesn’t believe it has given us the TNT recipe, rather believing it merely helped us with a story.

Paperclip maximizers and evil geniuses

Both paperclip maximizers and brilliant evil geniuses are consistent across almost all these factors, failing only as they’re misaligned: their values aren’t consistent with humanity’s reflectively endorsed values. When all other consistencies are maxed out, improving wisdom is the same as improving alignment. This is why we earlier suggested that wisdom should heavily depend on this particular consistency, since we wouldn't like to call an evil genius wise.

Conclusion

We need metrics to evaluate the internal coherence and benevolence of AI, distinct from its raw intelligence and capabilities. We propose this idea of wisdom as a global consistency across coherence factors, as we believe this could be useful for evaluating and optimizing AI.

If you work in AI development or a related field, we'd love for you to explore using this framework to evaluate the projects you’re working on, and as an optimization target for development and alignment.

We’d also love people's input on this topic. Can you extend these ideas, or come up with other example applications? Can you suggest any improvements? If you work in AI development, or find other ways to apply this framework, was it helpful? Let us know in the comments.

If you’re interested in this idea, you might also enjoy the following articles:

Selection Theorems: A Program For Understanding Agents by johnswentworth
Coherent decisions imply consistent utilities by Eliezer Yudkowsky
Coherent extrapolated volition on LessWrong
Improving the future by influencing actors' benevolence, intelligence, and power by MichaelA and Justin Shovelain
Doing good while clueless by Milan Griffes

This article is based on the ideas of Justin Shovelain, written by Elliot Mckernon, for Convergence Analysis. We’d like to thank Matt Goldenberg for his input on the topic, and Cesare Ardito, Harry Day, and David Kristofferrsson for their feedback while writing.

[-]Hoagy2y92

On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.

Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value -> value -> belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.

[-]Gordon Seidoh Worley2y50

I kind of want to comment on this but am finding it hard to do so, so I'll at least leave a comment expressing my frustration.

This post falls into some kind of uncanny valley of "feels wrong but both too much and not enough detail to criticize it directly". There's lots of wiggle room here with things underdefined in ways that are hard to really address and know if this seems reasonable or not. It pattern matches though to lots of things in the category of "hey, I just heard about alignment and I thought about it for a while and I think I see how to solve it" though misses the most egregious errors of that category of thing, which is why this is hard to say much about.

So I come away thinking I have no reason to think this will work but also unable to say anything specific about why I think it won't work other than I think there's a bunch of hidden details in here that are not being adequately explored.

[-]David_Kristoffersson2y*30

I would reckon: no single AI safety method "will work" because no single method is enough by itself. The idea expressed in the post would not "solve" AI alignment, but I think it's a thought-provoking angle on part of the problem.

[-]David_Kristoffersson2y42

I quite like the concept of alignment through coherence between the "coherence factors"!

"Wisdom" has many meanings. I would use the word differently to how the article is using it.

[-]Justin Bullock2y20

Thank you for this post! As I mentioned to both of you, I like your approach here. In particular, I appreciate the attempt to provide some description of how we might optimize for something we actually want, something like wisdom.

I have a few assorted thoughts for you to consider:

I would be interested in additional discussion around the inherent boundedness of agents that act in the world. I think self-consistency and inter-factor consistency have some fundamental limits that could be worth exploring within this framework. For example, might different types of boundedness systematically undermine wisdom in ways that we can predict or try and account for? You point out that these forms of consistency are continuous, which I think is a useful step in this direction
I'm wondering about feedback mechanisms for a wise agent in this context. For example, it would be interesting to know a little more about how a wise agent incorporates feedback into its model from the consequences of its own actions. I would be interested to see more in this direction in any future posts.
It strikes me that this post titled "Agency from a Causal Perspective" (https://www.alignmentforum.org/posts/Qi77Tu3ehdacAbBBe/agency-from-a-causal-perspective) might be of some particular interest to your approach here.

Excellent post here! I hope the comments are helpful!

LESSWRONG
LW

27