Comment Permalink

Interesting!

LCDT is has major structural similarities with some of the incentive-managing agent designs that have been considered by Everitt et al in work on Causal Influence Diagrams (CIDs), e.g. here and by me in work on counterfactual planning, e.g. here. These similarities are not immediately apparent however from the post above, because of differences in terminology and in the benchmarks chosen.

So I feel it is useful (also as a multi-disciplinary or community-bridging exercise) to make these similarities more explicit in this comment. Below I will map the LCDT defined above to the frameworks of CIDs and counterfactual planning, frameworks that were designed to avoid (and/or expose) all ambiguity by relying on exact mathematical definitions.

Mapping LCDT to detailed math

Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.

OK, so in the terminology of counterfactual planning defined here, an LCDT agent is built to make decisions by constructing a model of a planning world inside its compute core, then computing the optimal action to take in the planning world, and then doing the same action on the real world. The LCDT planning world model is a causal model, let's call it . This $C$ is constructed by modifying a causal model $B$ by cutting links. The $B$ we modify is a fully accurate, or reasonably approximate, model of bow the LCDT agent interacts with its environment, where the interaction aims to maximize a reward or minimize a loss function.

The planning world $C$ is a modification of $B$ that intentionally mis-approximates some of the real world mechanics visible in $B$ . $C$ is constructed to predict future agent actions less accurately than is possible, given all information in $B$ . This intentional mis-approximation this makes the LCDT into what I call a counterfactual planner. The LCDT plans actions that maximize reward (or minimize losses) in $C$ , and then performs these same actions in the real world it is in.

Some mathematical detail: in many graphical models of decision making, the nodes that represent the decision(s) made by the agent(s) do not have any incoming arrows. For the LCDT definition above to work, we need a graphical model where the decision-making nodes do have such incoming arrows. Conveniently, CIDs are such models. So we can disambiguate LCDT by saying that $B$ and $C$ are full causal models as defined in the CID framework. Terminology/mathematical details: in the CID definitions here, these full causal models $B$ and $C$ are called SCIMs, in the terminology defined here they are called policy-defining world models whose input parameters are fully known.

Now I identify some ambiguities that are left in the LCDT definition of the post. First, the definition has remained silent on how the initial causal world model $B$ is obtained. It might be by learning, by hand-coding (as in the benchmark examples), or a combination of the two. For an example of a models $B$ that is constructed with a combination of hand-coding and machine learning, see the planning world (p) here. There is also significant work in the ML community on using machine learning to construct from scratch full causal models including the nodes and the routing of the arrows themselves, or (more often) full Bayesian networks with nodes and arrows where the authors do not worry too much about any causal interpretation of the arrows. I have not tried this out in any examples, but I believe the LCDT approach might be usefully applied to predictive Bayesian networks too.

Regardless of how $B$ is obtained, we can do some safety analysis on the construction of $C$ out of $B$ .

The two works on CIDs here and here both consider that we can modify agent incentives by removing paths in the CID-based world model that the agent uses for planning its actions. In the terminology of the first paper above, the modifications made by LCDT to produce the model $C$ work to 'remove an instrumental control incentive on a future action'. In the terminology of the second paper, the modifications will 'make the agent indifferent about downstream nodes representing agent actions'. The post above speculates:

LCDT shows a form of indifference (related to indifference corrigibility maybe)

This is not a maybe: the indifference produced is definitely related to indifference corrigibility, the type of indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper titled Corrigibility talks about. For some detailed mathematical work relating the two, see here.

A second ambiguity in LCDT is that it tell us how exactly the nodes in $B$ that represent agent decisions are to be identified. If $B$ is a hand-coded model of a game world, identifying these nodes may be easy. If $B$ is a somewhat opaque model produced by machine learning, identifying the nodes may be difficult. In many graphical world models, a single node may represent the state of a huge chunk of the agent environment: say both the vases and conveyor belts in the agent environment and the people in the agent environment. Does this node then become a node that represents agent decisions? We might imagine splitting the node into two nodes (this is often called factoring the state) to separate out the humans.

That being said, even a less-than-perfect identification of these nodes would work to suppress certain deceptive forms of manipulation, so LCDT could be usefully applied even to somewhat opaque learned causal models.

A third ambiguity is in the definition of the operations needed to create a computable causal model $C$ after taking a copy of $B$ and cutting incoming links to the downstream decision nodes:

What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision.

It is ambiguous how to construct this 'fixed prior over its own decision' that we should use to marginalize on. Specifically, is this prior allowed to take into account some or all of the events that preceded the decision to be made? This ambiguity leaves a large degree of freedom in constructing $C$ by modifying $B$ , especially in a setting where the agents involved make multiple decisions over time. This ambiguity is not necessarily a bad thing: we can interpret is as an open (hyper)parameter choice that allows us to create differently tuned versions of $C$ that trade off differently between suppressing manipulation and still achieving a degree of economic decision making effectiveness. On a side note, in a multi-decision setting, drawing a $B$ that encodes marginalization on 10 downstream decisions will generally create a huge diagram: it will add 10 new sub-diagrams feeding input observations into these decisions.

LCDT also considers agent self-modification, However, given the way these self-modification decisions are drawn, I cannot easily see how these would generalize to a multi-decision situation where the agent makes several decisions over time. Representations of self-modification in a multi-decision CID framework usually require that one draws a lot of extra nodes, see e.g. this paper. As this comment is long already, I omit the topic of how to map multi-action self-modification to unambiguous math. My safety analysis below is therefore limited to the case of the LCDT agent manipulating other agents, not the agent manipulating itself.

Some safety analysis

LCDT obviously removes some agent incentives, incentives to control the future decisions made by human agents in the agent environment. This is nice because one method of control is deception, so it suppresses deception. However, I do not believe LCDT removes all incentives to deceive in the general case.

As I explain in this example and in more detail in sections 9.2 and 11.5.2 here, the use of a counterfactual planning world model for decision making may remove some incentives for deception, compared to using a fully correct world model, but the planning world may still retain some game-theoretical mechanics that make deception part of an optimal planning world strategy. So we have to consider the value of deception in the planning world.

I'll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.

[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct $B$ exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a $C$ that is exaclty the same model as $B$ , and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]

Say that agent is mechanically stronger at kicking left, so that a ball kicked left, when the goalkeeper also moves left, has a lower probability of being intercepted by the goalkeeper than in the right+right alternative. In that case, the most reasonable prior over the agent action will model an agent kicking left most of the time. Now, if we use this prior to marginalize the expectations of the human goalkeeper in the planning world, the planning world goalkeeper will expect the agent to kick the ball left most of the time, so they are more likely to move left.

Now observe that in the LCDT planning world model $C$ constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I'd argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.

A safer decision theory would marginalize the goalkeeper expectations with a random prior over agent actions. thereby removing the game-theoretical benefit of the agent doing something unexpected. If the goalkeeper knows the agent is using this safer decision theory, they can always run left.

Now, I must admit that I associate the word 'deception' mostly with multi-step policies that aim to implant incorrect knowledge into the opposite party's world model first, and then exploit that incorrect knowledge in later steps. The above example does only one of these things. So maybe others would deconfuse (define) the term 'deception' differently in a single-action setting, so that the above example does not in fact count as deception.

Benchmarking

The post above does not benchmark LCDT on Newcomb’s Problem, which I feel is a welcome change, compared to many other decision theory posts on this forum. Still, I feel that there is somewhat of a gap in the benchmarking coverage provided by the post above, as 'mainstream' ML agent designs are usually benchmarked in MDP or RL problem settings, that is on multi-step decision making problems where the objective is to maximize a time discounted sum of rewards. (Some of the benchmarks in the post above can be mapped to MDP problems in toy worlds, but they would be somewhat unusual MDP toy worlds.)

A first obvious MDP-type benchmark would be an RL setting where the reward signal is provided directly by a human agent in the environment. When we apply LCDT in this context, it makes the LCDT agent totally indifferent to influencing the human-generated reward signal: any random policy will perform equally well in the planning world $C$ . So the LCDT agent becomes totally non-responsive to its reward signal, and non-competitive as a tool to achieve economic goals.

In a second obvious MDP-type benchmark, the reward signal is provided by a sensor in the environment, or by some software that reads and processes sensor signals. If we model this sensor and this software as not being agents themselves, then LCDT may perform very well. Specifically, if there are innocent human bystanders too in the agent environment, bystanders who are modeled as agents, then we can expect that the incentive of the agent to control or deceive these human bystanders into helping it achieve its goals is suppressed. This is because under LCDT, the agent will lose some, potentially all, of its ability to correctly anticipate the consequences of its own actions on the actions of these innocent human bystanders.

Other remarks

There is an interesting link between LCDT and counterfactual oracles: whereas LCDT breaks the last link in any causal chain that influences human decisions, counterfactual oracle designs can be said to break the first link. See e.g. section 13 here for example causal diagrams.

When applying an LCDT-like approach construct a $C$ from a causal model $B$ , it may sometimes be easier to keep the incoming links to nodes in $B$ that model future agent decisions intact, and instead cut the outgoing links. This would mean replacing these nodes in $B$ with fresh nodes that generate probability distributions over future actions taken by the future agents(s). These fresh nodes could potentially use node values that occurred earlier in time than the agent action(s) as inputs, to create better predictions. When I picture this approach visually as editing a causal graph $B$ into a $C$ , the approach is more easy to visualize than the approach of marginalizing on a prior.

To conclude, my feeling is that LCDT can definitely be used as a safety mechanism, as an element of an agent design that suppresses deceptive policies. But it is definitely not a perfect safety tool that will offer perfect suppression of deception in all possible game-theoretical situations. When it comes to suppressing deception, I feel that time-limited myopia and the use of very high time discount factors are equally useful but imperfect tools.

Mapping LCDT to detailed math

Some safety analysis

Benchmarking

Other remarks

57

LCDT, A Myopic Decision Theory

57

Ω 32

The looming shadow of deception

A decision theory benchmark for myopia

Imitation (Capabilities)

Imitation with self-modification (Capabilities)

Iterated Imitation (Deception)

2 Variants of Approval-Direction (Capability)

2 Variants of Approval-Direction with self-modification (Capabilities)

Argmax HCH (Capabilities)

Argmax HCH with self-modification (Capabilities)

(N,M)-Deception problem

Starting at CDT

Making CDT Myopic: Lonely CDT

Why LCDT is a good myopic decision theory

Issues with LCDT

Cemetery of LCDT variants

LCDT[N]

Symmetric CDT

Nash LCDT

Nash CDT

Further Questions

Myopic simulation

Finding agents

Remaining possibilities for problematic long-term plans

Checking Myopia

Conclusion

57

Ω 32

Mapping LCDT to detailed math

Some safety analysis

Benchmarking

Other remarks