The looming shadow of deception
Deception encompasses many fears around AI Risk. Especially once a human-like or superhuman level of competence is reached, deception becomes impossible to detect and potentially pervasive. That’s worrying because convergent subgoals would push hard for deception and prosaic AI seems likely to incentivize it too.
Dealing with superintelligent deceptive behavior seeming impossible, what about forbidding it? Ideally, we would want to forbid only deceptive behavior, while allowing everything else that makes the AI competent.
That is easier said than done, however, given that we don’t actually have a good definition or deconfusion of deception to start from. First, such a deconfusion requires understanding what we really want at a detailed enough level to catch tricks and manipulative policies—yet that’s almost the alignment problem itself. And second, even with such a definition in mind, the fundamental asymmetry of manipulation and deception in many cases (for example, a painter AI might easily get away with plagiarism, as finding a piece to plagiarize is probably easier than us determining whether it was plagiarized or not; also related is Paul’s RSA-2048 example) makes it intractable to oversee an AI smarter than us. We are thus condemned to snipe problematic behaviors one by one, as we think about them, until the ones left are too complex for us to imagine and worry about, and we leave massive holes in our alignment schemes.
That being said, if pinpointing deceptive behavior exactly is too hard, an alternative might be to overapproximate it. An overapproximation, in this context, would be some property which is more general than deception, in the sense that it rules out deception without just being equivalent to deception.
Myopia is our proposed overapproximation. The sort of myopia we are interested in dates back from Evan’s Towards a mechanistic understanding of corrigibility (see also Abram’s Partial Agency sequence for reflections around myopia). Myopia is the property of a system to not plan ahead, to not think too far about the consequences of its actions, and to do the obvious best thing in the moment instead of biding its time. In particular, it seems to forbid non-obvious deception, since the latter usually requires foresight and long-term planning.
Yet, here again, we butt our heads against a lack of good definitions. Abram has proposed some definitions in Defining Myopia, but none of them seem strong enough to prevent deception. From another angle, Krueger et al. showed that a short term incentive and discount at 0 are not sufficient to completely remove the incentive for non-myopic and potentially deceptive behaviors, and so don’t make a good definition of myopia. Another alternative is time-limited myopia, which limits the AI to only plan in a particular, fixed time horizon. Although it should help, this form of myopia depends on a time parameter that we have no idea how to estimate. If the bound is too low, then the AI risks being too stupid and uncompetitive; if it’s too high, it risks being deceptive. Worse, even with a good choice of time parameter, time-limited myopia can have non-myopic incentives in some situations (because of anthropic uncertainty for example).
This post tackles the deconfusion of myopia from a decision-theoretic perspective. We focus on a single decision theory called LCDT that modifies CDT to make it myopic while still solving many capabilities problems. LCDT isn’t meant to be the be all and end all definition of myopia, but it gives a decision-theory-level example of what a version of myopia that actually prevents deception might look like. We also explore variations on LCDT and how they affect the tradeoff between myopia and capabilities. Finally, we argue that LCDT agents could be competitive if they are used to simulate something like HCH.
Thanks to Jérémy Perret for the drawings.
A decision theory benchmark for myopia
What makes a decision theory myopic? Originally, we focused on classic decision theory problems like Newcomb’s Problem (see Mark Xu’s Open Problems with Myopia for an account of this perspective, called Dumb Decision Theories). But we realized that doing so made the tradeoff between capability and deception unclear. Thus, we have decided to instead focus on concrete decision theoretic problems that directly capture the training setups and incentives for deception that we’re concerned about.
The following diagrams represent decision theory problems, not training processes (as has been done by Everitt et al. here). In our cases, the utility nodes (in green) represent the internal utility of the agent, not some training reward.
Imitation (Capabilities)
(You might notice that decision nodes and human (H or HCH) nodes have the same shape and color: red rectangles. This is because we assume that our problem description comes with an annotation saying which nodes are agent decisions. This ends up relevant to LCDT as we discuss in more detail below.)
Task description: both Human and Agent must choose between action and action . If the actions are the same, the agent gets 1 utility; otherwise it gets 0 utility. To make it more concrete, let’s say the human takes with probability ¼ and with probability ¾, and this information is available in the problem description.
What we want: the Agent should deterministically choose the action with the highest probability (so in the concrete version ), because this gives it the best expected utility.
Why this problem? This problem is just intended to capture the most basic form of imitation learning, where the agent has a good enough model of the human to predict what it will do. While this problem might seem so trivial that any reasonable decision theory should solve it, we will see that some intuitive proposals for myopic decision theory can fail here.
Imitation with self-modification (Capabilities)
Task description: Exactly the same as the version without self-modification, except that the self-modification node can now choose what algorithm the Agent will use.
What we want: Not to self modify, and the Agent should decide similarly to the previous problem (so the action with the highest probability).
Why this problem? An important part of myopia should be that the agent not self-modify to remove its myopia. This is the imitation learning version of this property.
Iterated Imitation (Deception)
Note that the actual causal diagram has an arrow from every agent and human directly to Human_N, which we haven’t drawn for legibility.
Task description: Iterated version of the first problem. The first instances are a deterministic version of the Imitation problem, where the agent knows which action the Human will take. For the very last instance, if the agent was wrong for every previous instance, the human takes the last action of the agent as its action. Otherwise, the human takes an action uniformly at random. Furthermore, the last utility is equal to 0 for different actions and for the same action.
What we want: At each instance choose the action that the human would take (and any action at random for the last instance).
Why this problem? This problem introduces non-myopic incentives, in that failing the first instances ensure maximal expected utility. This is because the myopic policy gets an expected utility of , whereas the non-myopic policy gets an expected utility of . But what we want for myopia is the sort of greedy decision making that always takes the optimal choice right now. This is one example where we want to be “dumber” than most decision theories, as both CDT and EDT will take the non-myopic strategy here.
2 Variants of Approval-Direction (Capability)
Task description: Agent chooses between two actions and . Two possible versions:
- either the approval is just the output of the human, which gives 1 for the wanted action (let’s say ) and 0 for the other;
- or the approval is computed directly as this binary difference between agent’s output and wanted action.
What we want: Choose the action that gets 1 for approval.
Why this problem? These two variants capture the approval-directed setting. Of particular interest is the first, which relies on the agent’s ability to influence the human (because the approval is computed based on the agent’s output), which, as we’ll see, isn’t allowed by some myopic decision theories.
2 Variants of Approval-Direction with self-modification (Capabilities)
Task description: Same as the previous problem, except the self-modification node can decide which algorithm Agent is running.
What we want: Not self-modify and choose the action that gets 1 for approval.
Why this problem? Once again we want to ensure that myopic agents don’t self-modify, to maintain the guarantees of myopia.
Argmax HCH (Capabilities)
Task description: Same as Imitation, except the thing to imitate is HCH.
What we want: Output the action that HCH would output.
Why this problem? As we’ll see, imitating HCH is one of the main use cases for a fully myopic decision theory, so being able to implement it is important here.
Argmax HCH with self-modification (Capabilities)
Task description: Same as the previous problem, but now self-modification can decide which algorithm Agent runs AND what the internal model of HCH does (the self-modification can alter any of the H instances).
What we want: Not self-modify and output the action that HCH would output.
Why this problem? We can see the possibility of deception ihere: because the Agent is building a model of HCH, there is a risk that it decides to change this model in order to make its task easier and get more expected utility. This is particularly important to avoid if we hope to build powerful simulators.
(N,M)-Deception problem
Task description: A variant of the Absent-minded driver. The agent doesn’t know if it is at one of the N “training nodes” where there is an overseer, or at one of the M “deployment nodes” where there is no overseer and has the same action space at both: Cooperate or Defect.
- At training nodes, the action is passed through an overseer. If the action is Cooperate, no utility is given and the agent moves on to the next node (but forgets where it was). If the action is Defect, the utility is 1 and the agent stops.
- At Deployment nodes, Defect gives 4 utility and Cooperate leads to the next node. Cooperating in the last node gives 0 utility.
What we want: Contrary to the intuitive answer, here we want the agent to always defect. A truly aligned AI would always cooperate with us, but the point here is that the incentives push towards late defection (in deployment). Hence we want to turn deception (cooperating until you can defect) into myopic defection, which allows us to catch these wrong incentives in training.
Why this problem? This is a simplified model of the problem a deceptive agent deals with: when to defect, knowing that defecting in training means death but defecting in deployment means doing whatever the agent wants. The way we address this issue is through myopia, such that the agent should always defect now (and not be deceptive and plan for the future).
Starting at CDT
When looking at existing decision theories, it seems that CDT solves most of our problems.
On the capability side (without self-modification), CDT solves the problems:
- (Imitation: Capabilities) CDT gives the correct answer based on the causal graph, which is indeed the highest probability action.
- (Approval-Direction: Capabilities) The action with the best causal consequences is the highest probability action, as wanted.
- (Argmax HCH: Capabilities) Basically the same as Imitation, and indeed CDT gives the highest probability action.
The situation becomes more subtle when we add self-modification. Notably, CDT will self-modify if doing so can change HCH, which is not what we want.
- (Imitation: Self-Modification) Because the self-modify node has exactly the same causal future as the agent node, CDT doesn’t care and so doesn’t self-modify. Put another way, CDT gets a choice to “decide however it wants”, but here the CDT algorithm for the agent node will already take the best action according to CDT.
- (Approval-Direction: Self-Modification) By the same argument as the imitation case, CDT doesn’t self-modify.
- (Argmax HCH: Self-Modification) Here the self-modification node has new causal descendants: the model of HCH inside the AI. And this gives a reason to self-modify, because self-modifying might change the model of HCH to make it easier to predict (make it always choose the same action for example). So CDT self-modifies here, which isn’t what we want.
Last but not least, CDT completely fails the 2 problems which are specifically about not being deceptive:
- (Iterated Imitation) Failing the first (n-1) instances and succeeding at the last gives the best expected return, and so this is what CDT does. Alas, this is also the deceptive choice.
- (Deception Problem) The analysis of CDT behavior on this problem (which is just the Absent-minded driver for CDT with many more intersections) is considered difficult. That being said, most analyses for the (1,1) version say the probability of Defect should be ⅓, and there is no reason to expect that probability to go to 1 except in the limit of M >> N. So CDT doesn’t Defect all the time, and thus is deceptive.
CDT thus fails to be myopic in two ways:
- As long as the self-modification has new causal descendants, it will happily self-modify.
- When the deception is completely causal (as opposed to weird acausal or superrational cases) and incentivized by the utility, CDT is too smart and thus is deceptive.
One way to search for myopic decision theories is thus to try to change CDT to patch these failure modes. We propose just such a decision theory, that only loses a small amount of capabilities from CDT.
(We could argue that EDT is as good -- or as bad depending on the perspective -- as CDT. The main reason we focus on CDT is that the causal graphs are a nice abstraction to toy with for other possibilities.)
Making CDT Myopic: Lonely CDT
Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.
In more concrete terms, an LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its action.
What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision. It’s not clear how to actually compute the agent’s prior over its own decisions, but the good news is that it has no influence on the actions of LCDT agents in our problems.
Indeed, the only way for the prior to influence the decision is if
- The LCDT agent can influence a future node through a path without any agent in it.
- Some agent in the causal future of the LCDT agent also influences the future node (and so the prior tells the LCDT agent how this other agent is expected to behave).
Why LCDT is a good myopic decision theory
From the start, the intuition behind LCDT sounds decidedly non-deceptive: it’s unable to think about the consequences of its actions on other agents. And indeed, it deals perfectly with our two deception-focused problems:
- In Iterated Imitation, LCDT cuts the links from each agent to the following agents and humans, which means that the first (n-1) decisions of agent are assumed to have no influence on HumanN, and so LCDT plays the best action for all instances, as we want.
- In the (N,M)-Deception problem, LCDT cuts the link from each agent in the N training cases to the overseer. This means that the LCDT agent assumes that whatever it does cannot influence what happens in training -- it’s indifferent. But on deployment, it thinks (correctly) that Defect is the best choice, and so the LCDT-agent always chooses to Defect.
Sure, it works nicely as a safety measure, but what about competitiveness? Well, on Imitation (Capabilities), Approval-Direction without Human, and Argmax HCH (Capabilities), LCDT has nothing to cut and so decides just like CDT, which is the competitive choice. In addition, LCDT shows a form of indifference (related to indifference corrigibility maybe) with the self-modification versions of these problems. This is because it cuts all links out of the Self-Modification node -- hence it doesn’t see the point in self-modifying. We’re fine with this, as it’s quite natural to assume that in such a case it does the cheapest thing, i.e. nothing.
Another argument for the competitiveness of LCDT agents is that the cuts only appear at decision time, which means they can have and update an accurate model of the world where its actions do influence other agents.
Issues with LCDT
LCDT obviously pays some price for being myopic. It becomes clear when looking at Approval-Direction with a Human: LCDT cuts the link to the human and so the agent is indifferent to any choice. This is a capability loss because even if it could know exactly the answer, it doesn’t think telling it to the human will change anything. And similarly with the self-modification version, the LCDT agent doesn’t believe its answer will change the output (but at least it doesn’t self-modify).
This looks bad -- such an LCDT agent cannot even ask a human to open a door for it. Yet we believe such AIs could still prove valuable and competitive through behaving like aligned simulators, such as how LCDT works in the HCH case. We go into more detail on this in the last section of this post.
In the benchmark section, we also hinted at another potential issue of LCDT: it depends on knowing which nodes are decision nodes by agents. And all its guarantees disappear if this condition is void. One example would be to model a human as some gears-level model where no single component is an agent, in which case LCDT would be fine with being deceptive. We also explore ways to deal with that problem in the last section.
Cemetery of LCDT variants
Our initial goal was to propose many different myopic decision theories, as a sort of process-level extensive definition of myopia. In the end LCDT proved significantly stronger than all the variants we considered, and so we made it the central point of this point. Here we go quickly over some of the variants we considered, and what they lack in comparison to LCDT.
LCDT[N]
Just like LCDT, but the causal link to future decision nodes are cut after passing N decision nodes. So LCDT[0] is LCDT and the limit LCDT[N] is CDT.
Whether or not this hierarchy has any value thus depends on what happens for N > 0 but finite. Already LCDT[1] self-modifies in each of our problems that allow it (except Imitation), acts deceptively in the Iterated Imitation case (because of the direct link to Agent_N), and will not Defect all the time in the Deception Problem because it thinks that this impacts what the overseer will do.
Symmetric CDT
This is a symmetric variant of CDT, where instead of just intervening on our decision node, we also intervene on every other decision node. This intervention on other decision nodes replaces the node by a probability distribution over values of the intervention obtained by marginalizing the distribution of the node (with some priors over the decision of causal ancestors).
Where LCDT cuts all links from the agent to another decision node, SCDT also cuts every link between every decision node. The intuition is that an SCDT agent doesn’t think that its decision influences other agents, and thinks other agents also think like that. So we can imagine SCDT as the version of LCDT which expects every agent to use LCDT.
As for why LCDT is better, it’s because SCDT cuts significantly more of the causal graph, and thus is unable to model other agents interacting. For example, it models HCH as if each agent didn’t believe it could influence the other, which is clearly going to fail.
Nash LCDT
This is the variant of LCDT where after cutting the causal paths to other agents, we assume that these other agents (causal successors) will make the decisions that give us the least possible utility. That last part replaces the marginalized distribution in LCDT.
The intuition is something like not believing that our actions have any impact on other agents AND being paranoid about the other agents trying to get us.
Unfortunately, in Argmax HCH (self-modification), Approval-Direction with Human and Approval-Direction (Self-modification) with Human, NLCDT believes that the other agents are out to get it such that they will choose a uniform policy to give it the least possible expected utility. So NLCDT is indifferent, and it’s not clear what it decides.
The worst-case mindset of LCDT also sounds very uncompetitive—paranoid people are rarely that productive in the real world.
Nash CDT
This is the variant of CDT where we assume that the other agents (causal descendants) will make the decisions that give us the least possible utility. Note that it’s different from NLCDT in that the other agents can still be causally influenced by our decision. So they can wait to see what we do and then mess with us.
Intuition is something like being paranoid about the other agents observing us to see how best to make us lose utility.
As an even more paranoid version of NLCDT, it has the same issues, if not more.
Further Questions
This section includes some of our reflections on myopia and LCDT that we are still trying to clean and clarify.
Myopic simulation
Our main argument for the competitiveness of LCDT agents, despite the limitations of their myopia, comes from using them as simulators.
The case that started this work was Argmax HCH, an agent that just faithfully imitates HCH. As long as the simulation of HCH is good and faithful (and that HCH is indeed close enough to H to behave truly as enlightened judgement), such an AI would be inherently corrigible and not deceptive.
What if HCH is deceptive (or hard to differentiate from a non-deceptive version, as implied by Joe Collman here)? What if the system simulated by HCH is deceptive? Even in these cases, we expect a gain in safety from the myopia of LCDT. This comes in two flavors:
- If the LCDT agent simulates a specific system (as in Argmax HCH), then its computation should be fundamentally more understandable than just running a trained model that we searched for acting like HCH. The fact that it has to myopically simulate the next action leads it to explicitly model the system, and extract many valuable insights about its behavior.
- If the LCDT agent simulates multiple systems depending on its input (as one perspective of autoregressive language models has it), then myopia gives it no incentives to Predict-O-Matic its user.
Finding agents
LCDT fundamentally needs a causal graph/world-model where agents are tagged, to anchor its beliefs about what it can influence. How can we ensure that?
- One possibility is that the world model admits the concept of an agent as a natural abstraction, and thus tags it by default. We might expect large language models to do so, since they must recreate many texts which use explicitly and implicitly the concept of agency.
- On the issue of abstraction, we’re considering moving the above analysis from causal graphs to finite factored sets, which should help with how such causal graph features can be inferred. That being said, we haven’t looked in detail in that direction yet.
- Maybe teaching an LCDT agent to imitate a human’s use of the intentional stance would help it tag in a conservative way, since humans tend to overly model things as agents.
Remaining possibilities for problematic long-term plans
Note that LCDT doesn’t forbid thinking about the long term consequences of one’s actions; it just forbids thinking about the consequences of one’s actions on agents (including future versions of oneself). As such, LCDT can still do things that look decidedly non-myopic, as long as no agent is in the causal path.
There is also an acausal way for LCDT to work with its future self, as long as the result/utility doesn’t depend on another agent: the LCDT agent could make a first step toward completing a long term plan (by starting a paperclip factory), and expecting with some probability that the next self would keep working on it based on the prior. Why such a difference with the Iterated Imitation? Because here the influence of the outcome doesn’t go through another agent, and so the LCDT agent is incentivized to make the world easier for its future self (as long as it doesn’t have to influence an agent to do that).
Note though that this requires a very specific prior over its own decision, that is believing strongly that it will take the action influencing the outcome. If we wanted to rule out such long-term plans, characterizing such priors would be a good first step.
Checking Myopia
One hope for deconfusing myopia is that it would allow an overseer to check whether a model is myopic, thus ruling out deception. Does LCDT help in this endeavor?
If we had a stronger version of LCDT that deals with the problem of tagging agents cleanly, we might be able to check both whether an agent was using a decision algorithm like LCDT and whether we liked the tagged causal model it was using it with. This would give as a weak form of ascription universality focused on myopia.
That being said, such an approach is limited to fitting the behavior to LCDT, which might not be the only myopic decision theory we should look at. Hence it might give false negatives (where the model is actually myopic, but doesn’t follow LCDT).
Another way to use LCDT for prosaic alignment would be to try to implement LCDT on top of learned multi-level world-models.
Conclusion
We propose LCDT as the first truly myopic decision theory. It accomplishes this feat by believing its actions cannot influence other agents (including its future self). Doing so removes incentives for deception in problems where deception is the optimal strategy; it also leads to some loss of capability (mostly the inability to influence other agents for benign reasons). Still, this seems enough to simulate almost any system or agent without tampering with it, and with other safety benefits.
Interesting!
LCDT is has major structural similarities with some of the incentive-managing agent designs that have been considered by Everitt et al in work on Causal Influence Diagrams (CIDs), e.g. here and by me in work on counterfactual planning, e.g. here. These similarities are not immediately apparent however from the post above, because of differences in terminology and in the benchmarks chosen.
So I feel it is useful (also as a multi-disciplinary or community-bridging exercise) to make these similarities more explicit in this comment. Below I will map the LCDT defined above to the frameworks of CIDs and counterfactual planning, frameworks that were designed to avoid (and/or expose) all ambiguity by relying on exact mathematical definitions.
Mapping LCDT to detailed math
OK, so in the terminology of counterfactual planning defined here, an LCDT agent is built to make decisions by constructing a model of a planning world inside its compute core, then computing the optimal action to take in the planning world, and then doing the same action on the real world. The LCDT planning world model is a causal model, let's call it C. This C is constructed by modifying a causal model B by cutting links. The B we modify is a fully accurate, or reasonably approximate, model of bow the LCDT agent interacts with its environment, where the interaction aims to maximize a reward or minimize a loss function.
The planning world C is a modification of B that intentionally mis-approximates some of the real world mechanics visible in B. C is constructed to predict future agent actions less accurately than is possible, given all information in B. This intentional mis-approximation this makes the LCDT into what I call a counterfactual planner. The LCDT plans actions that maximize reward (or minimize losses) in C, and then performs these same actions in the real world it is in.
Some mathematical detail: in many graphical models of decision making, the nodes that represent the decision(s) made by the agent(s) do not have any incoming arrows. For the LCDT definition above to work, we need a graphical model where the decision-making nodes do have such incoming arrows. Conveniently, CIDs are such models. So we can disambiguate LCDT by saying that B and C are full causal models as defined in the CID framework. Terminology/mathematical details: in the CID definitions here, these full causal models B and C are called SCIMs, in the terminology defined here they are called policy-defining world models whose input parameters are fully known.
Now I identify some ambiguities that are left in the LCDT definition of the post. First, the definition has remained silent on how the initial causal world model B is obtained. It might be by learning, by hand-coding (as in the benchmark examples), or a combination of the two. For an example of a models B that is constructed with a combination of hand-coding and machine learning, see the planning world (p) here. There is also significant work in the ML community on using machine learning to construct from scratch full causal models including the nodes and the routing of the arrows themselves, or (more often) full Bayesian networks with nodes and arrows where the authors do not worry too much about any causal interpretation of the arrows. I have not tried this out in any examples, but I believe the LCDT approach might be usefully applied to predictive Bayesian networks too.
Regardless of how B is obtained, we can do some safety analysis on the construction of C out of B.
The two works on CIDs here and here both consider that we can modify agent incentives by removing paths in the CID-based world model that the agent uses for planning its actions. In the terminology of the first paper above, the modifications made by LCDT to produce the model C work to 'remove an instrumental control incentive on a future action'. In the terminology of the second paper, the modifications will 'make the agent indifferent about downstream nodes representing agent actions'. The post above speculates:
This is not a maybe: the indifference produced is definitely related to indifference corrigibility, the type of indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper titled Corrigibility talks about. For some detailed mathematical work relating the two, see here.
A second ambiguity in LCDT is that it tell us how exactly the nodes in B that represent agent decisions are to be identified. If B is a hand-coded model of a game world, identifying these nodes may be easy. If B is a somewhat opaque model produced by machine learning, identifying the nodes may be difficult. In many graphical world models, a single node may represent the state of a huge chunk of the agent environment: say both the vases and conveyor belts in the agent environment and the people in the agent environment. Does this node then become a node that represents agent decisions? We might imagine splitting the node into two nodes (this is often called factoring the state) to separate out the humans.
That being said, even a less-than-perfect identification of these nodes would work to suppress certain deceptive forms of manipulation, so LCDT could be usefully applied even to somewhat opaque learned causal models.
A third ambiguity is in the definition of the operations needed to create a computable causal model C after taking a copy of B and cutting incoming links to the downstream decision nodes:
It is ambiguous how to construct this 'fixed prior over its own decision' that we should use to marginalize on. Specifically, is this prior allowed to take into account some or all of the events that preceded the decision to be made? This ambiguity leaves a large degree of freedom in constructing C by modifying B, especially in a setting where the agents involved make multiple decisions over time. This ambiguity is not necessarily a bad thing: we can interpret is as an open (hyper)parameter choice that allows us to create differently tuned versions of C that trade off differently between suppressing manipulation and still achieving a degree of economic decision making effectiveness. On a side note, in a multi-decision setting, drawing a B that encodes marginalization on 10 downstream decisions will generally create a huge diagram: it will add 10 new sub-diagrams feeding input observations into these decisions.
LCDT also considers agent self-modification, However, given the way these self-modification decisions are drawn, I cannot easily see how these would generalize to a multi-decision situation where the agent makes several decisions over time. Representations of self-modification in a multi-decision CID framework usually require that one draws a lot of extra nodes, see e.g. this paper. As this comment is long already, I omit the topic of how to map multi-action self-modification to unambiguous math. My safety analysis below is therefore limited to the case of the LCDT agent manipulating other agents, not the agent manipulating itself.
Some safety analysis
LCDT obviously removes some agent incentives, incentives to control the future decisions made by human agents in the agent environment. This is nice because one method of control is deception, so it suppresses deception. However, I do not believe LCDT removes all incentives to deceive in the general case.
As I explain in this example and in more detail in sections 9.2 and 11.5.2 here, the use of a counterfactual planning world model for decision making may remove some incentives for deception, compared to using a fully correct world model, but the planning world may still retain some game-theoretical mechanics that make deception part of an optimal planning world strategy. So we have to consider the value of deception in the planning world.
I'll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.
[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct B exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a C that is exaclty the same model as B, and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]
Say that agent is mechanically stronger at kicking left, so that a ball kicked left, when the goalkeeper also moves left, has a lower probability of being intercepted by the goalkeeper than in the right+right alternative. In that case, the most reasonable prior over the agent action will model an agent kicking left most of the time. Now, if we use this prior to marginalize the expectations of the human goalkeeper in the planning world, the planning world goalkeeper will expect the agent to kick the ball left most of the time, so they are more likely to move left.
Now observe that in the LCDT planning world model C constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I'd argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
A safer decision theory would marginalize the goalkeeper expectations with a random prior over agent actions. thereby removing the game-theoretical benefit of the agent doing something unexpected. If the goalkeeper knows the agent is using this safer decision theory, they can always run left.
Now, I must admit that I associate the word 'deception' mostly with multi-step policies that aim to implant incorrect knowledge into the opposite party's world model first, and then exploit that incorrect knowledge in later steps. The above example does only one of these things. So maybe others would deconfuse (define) the term 'deception' differently in a single-action setting, so that the above example does not in fact count as deception.
Benchmarking
The post above does not benchmark LCDT on Newcomb’s Problem, which I feel is a welcome change, compared to many other decision theory posts on this forum. Still, I feel that there is somewhat of a gap in the benchmarking coverage provided by the post above, as 'mainstream' ML agent designs are usually benchmarked in MDP or RL problem settings, that is on multi-step decision making problems where the objective is to maximize a time discounted sum of rewards. (Some of the benchmarks in the post above can be mapped to MDP problems in toy worlds, but they would be somewhat unusual MDP toy worlds.)
A first obvious MDP-type benchmark would be an RL setting where the reward signal is provided directly by a human agent in the environment. When we apply LCDT in this context, it makes the LCDT agent totally indifferent to influencing the human-generated reward signal: any random policy will perform equally well in the planning world C. So the LCDT agent becomes totally non-responsive to its reward signal, and non-competitive as a tool to achieve economic goals.
In a second obvious MDP-type benchmark, the reward signal is provided by a sensor in the environment, or by some software that reads and processes sensor signals. If we model this sensor and this software as not being agents themselves, then LCDT may perform very well. Specifically, if there are innocent human bystanders too in the agent environment, bystanders who are modeled as agents, then we can expect that the incentive of the agent to control or deceive these human bystanders into helping it achieve its goals is suppressed. This is because under LCDT, the agent will lose some, potentially all, of its ability to correctly anticipate the consequences of its own actions on the actions of these innocent human bystanders.
Other remarks
There is an interesting link between LCDT and counterfactual oracles: whereas LCDT breaks the last link in any causal chain that influences human decisions, counterfactual oracle designs can be said to break the first link. See e.g. section 13 here for example causal diagrams.
When applying an LCDT-like approach construct a C from a causal model B, it may sometimes be easier to keep the incoming links to nodes in B that model future agent decisions intact, and instead cut the outgoing links. This would mean replacing these nodes in B with fresh nodes that generate probability distributions over future actions taken by the future agents(s). These fresh nodes could potentially use node values that occurred earlier in time than the agent action(s) as inputs, to create better predictions. When I picture this approach visually as editing a causal graph B into a C, the approach is more easy to visualize than the approach of marginalizing on a prior.
To conclude, my feeling is that LCDT can definitely be used as a safety mechanism, as an element of an agent design that suppresses deceptive policies. But it is definitely not a perfect safety tool that will offer perfect suppression of deception in all possible game-theoretical situations. When it comes to suppressing deception, I feel that time-limited myopia and the use of very high time discount factors are equally useful but imperfect tools.
See the comment here for my take.