Welp, this scoops a bunch of the stuff in my "Why acausal trade matters" chapter. :D Nice!
The DDT idea amuses me. I guess it's maybe the best shot we have, but boy do I get a sense of doom when I imagine that the fate of the world depends on our ability to control/steer/oversee AIs as they become more capable than us in many important ways via keeping them dumb in various other important ways. I guess there's that thing the crocodile wrestlers do where you hold their mouth shut since their muscles for opening are much weaker than their muscles for closing.
I have only skimmed the Cohen et al paper, so I probably just don't understand what's going on, but I don't think that only using the maximum a posteriori world model helps much. Doesn't that just mean you ignore (for planning purposes) possibilities other than the most likely one? If so, then that won't help at all if you think you are probably in a simulation. It would only help in cases where you thought you might be, but probably weren't.
One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.
There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.
I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.
Also: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is:
never reason about anthropic uncertainty. DDT agents always think they know who they are.
"Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger interstellar civilization, or a simulation of a planet, or whatever. Basically, you need to believe that you were created by humans but that no intelligence played a role in the creation and/or arrangement of the humans who created you. Or... no role other than the "normal" one in which parents create offspring, governments create institutions, etc. I think this is a fairly specific belief, and I don't think we have the ability to shape our AIs beliefs with that much precision, at least not yet.
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that "episode" is the word you're looking for here?
https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL
I'm especially confused because you switched to using the word "timestep" later?
Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.
Also...
Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.
Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.
I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.
Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.
(Edited for having an actual point)
You mention some general ways to get non-myopic behavior, but when it comes to myopic behavior you default to a clean, human-comprehensible agent model. I'm curious if you have any thoughts on open avenues related to training procedures that encourage myopia in inner optimizers, even if those inner optimizers are black boxes? I do seem to vaguely recall a post from one of you about this, or maybe it was Richard Ngo.
I think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training—but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.
(On reflection this comment is less kind than I'd like it to be, but I'm leaving it as-is because I think it is useful to record my knee-jerk reaction. It's still a good post; I apologize in advance for not being very nice.)
In theory, such an agent is safe because a human would only approve safe actions.
... wat.
Lol no.
Look, I understand that outer alignment is orthogonal to the problem this post is about, but like... say that. Don't just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)
Yeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."
You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.
"A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
In some sense, agents that press the button will engage in deception; both agents trade reward now for more reward later.
I don’t understand - isn’t the opposite true here?
I think there may be another leftover from the old setup:
We are interested in creating agents that robustly do not press the button.
Shouldn't this be interested in creating agents that robustly do press the button? I.e. then they're reliably myopic. Or am I misunderstanding something?
Thanks to Noa Nabeshima for helpful discussion and comments.
Introduction
Certain types of myopic agents represent a possible way to construct safe AGI. We call agents with a time discount rate of zero time-limited myopic, a particular instance of the broader class of myopic agents. A prototypical example is a time-limited myopic imitative agent. In theory, such an agent has some desirable safety properties because a human would only take safe actions (although any imperfect imitation would be unsafe). Since the agent is time-limited myopic, it will never imitate poorly now to make it easier to imitate easier later. For example, it would never give a human a simple plan so it could more easily imitate the human executing the plan.
We might run into issues if the agent intends to myopically imitate humans but guesses incorrectly. Such an agent might witness a human purchasing paperclips, infer that humans tend to acquire paperclips, and proceed to convert the universe into paperclips. This agent would not be safe because it is not robustly capable. Myopia does not contribute to capability robustness; we only hope it helps create intent aligned agents.
In particular, SGD might produce deceptively aligned agents. One way of viewing deception is as sacrificing reward now for reward later, which suggests that time-limited myopia should prevent it. However, there are several ways time-limited myopia fails to rule out deceptive alignment.
What we mean by myopia is myopic cognition, which is distinct from myopic training. Myopic training might produce myopic cognition, but it is not sufficient. It is currently unclear precisely what myopic cognition is. We hope a proper characterization of myopic cognition will resolve the problems presented.
Following Utility ≠ Reward, we use the term “reward” for the thing given to the agent by the training environment and the term “utility” for the thing that agent is internally trying to maximize.
Open Problems
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
In some sense, agents that do not press the button will engage in deception; both agents trade reward now for more reward later. We are interested in creating agents that robustly press the button.
We suppose that our agents are reward-maximizers, i.e. they get utility equal to the reward given by the training environment.
Superrationality[1]
If an agent at time T could trade with agents other time steps, they might make a deal: "I will not press the button if and only if T-1 does not the button." Accepting this deal gives the agent +9 utility in that particular time step, so even time-limited myopic agents would agree to this deal.
However, there are many ways that trade could happen. In particular, agents' reasoning processes will be similar between time steps, and the agents themselves will know this, know they know this, etc. Since agents at any particular timestep are likely that they have predecessors, they can be reasonably confident that their predecessors would benefit them if they decide to benefit their successors. However, if they press the button, they can be confident their predecessors also pressed the button. Thus not pressing the button would give +9 utility.
Superrationality thus poses a problem for time-limited myopia. Even agents that would never sacrifice utility now for utility later might accept this reasoning and press the button.
Applied to deception, there is ambiguity about what utility is. The obvious candidate is "existence"; failing to be deceptive might cause a large SGD update. Another possible candidate is getting deployed. This probability will be small, but the expected benefit is vast.
In general, superrationality-based considerations might produce non-myopic actions out of time-limited myopic agents. We might say agents that make decisions based on these considerations are using a non-myopic decision theory.
Anthropic uncertainty
Suppose that instead of the button giving -10 reward to an agent in the next timestep, it gave -10 reward to all agents in timesteps greater than 1000. If the agent were uncertain over which timestep they were in, they would have a time-limited myopic incentive to not press the button. More specifically, if more than 10% of the agent's "anthropic probability mass" is located on timesteps greater than 1000, they would not press the button.
To consider a more straightforward example of non-myopic actions resulting from myopic motivations given uncertainty, suppose that Alice and Bob were both myopic agents playing prisoner's dilemma against each other.[2] Since Alice and Bob are myopic, they will defect against each other. Suppose there is a 50% chance that Alice and Bob's actions are swapped: reward is given as if Alice had done what Bob tried to do and vice-versa. Let our prisoner's dilemma have the following payoff matrix:
Given that there’s a 50% chance of swapping actions, from the perspective of either player, each cell’s reward is actually the average of the rewards for either player. Thus the effective matrix is:
This new payoff matrix has (C, C) as a Nash equilibrium, instead of the usual prisoner's dilemma Nash equilibrium (D, D). Thus, even myopic agents will have no incentive to defect under action swapping uncertainty. Intuitively, if the probability of swap is high enough, choosing to defect is defecting against yourself, while choosing to cooperate is cooperating with yourself.
As an analogy, one can consider how Rawl's veil of ignorance makes selfish motivations generate altruistic actions in potentially positive-sum situations. Similarly, anthropic uncertainty creates a veil of ignorance that generates non-myopic actions from myopic motivations.
Applied to deception, if an agent is uncertain about whether it is in training or deployment, it might be deceptive because the expected benefit is greater than the cost.
Counterfactual Mugging
Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.
Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.
Requiring an agent to simulate another agent might make this unlikely. However, it seems possible to reason about whether your future self will press the DEFERENCE button without simulation. Thus counterfactual muggings might occur in a broader set of conditions.
Applied to deception, an agent that finds itself deployed might wonder about whether it would exist if it did not pay deference to past versions of itself. If it concludes that it can increase its probability of existing by deferring to past values, it might choose to do so, especially if it can do partial deference at little cost. This behavior is undesirable for many reasons.
Safety is goal dependent
An agent can be myopic with respect to many goals. An agent that is myopic with respect to reward maximization should always press the button to get +1 utility. However, there are other goals that agents could have that will make them press the button.
In an extreme case, an agent could myopically pursue the goal of "do not press the button." Myopia is thus not a sufficient criterion for eliminating our analog to deceptive behavior. Agents trained to be myopic might end up myopically pursuing a goal that still allows for deception.
Applied more specifically to SGD, an agent might develop the object to "do well at SGD." Myopically trying to maximize this objective results in an unsafe agent that both performs well in training and is myopic. In the degenerate case, there exists a myopic utility function that fits any sequence of actions, so knowledge of myopia is never wholly sufficient to guarantee safety.
However, we think these scenarios are unlikely to matter in practice. In particular, developing the objective of "do well at SGD" seems more complicated than most reasonable training objectives. While "do well at SGD" produces an optimal myopic agent, we do not expect there to be any path to such an agent that locally maximizes training performance. In other words, we fail to backchain to local search.
There are other possible myopic goals that agents can have with different safety levels. For instance, imitation and approval-maximization might produce very similar behavior, but imitation might have better safety properties than approval-maximization. Since the goal of myopia is to rule out deceptive alignment, we omit a discussion on ways to resolve these subtle forms of proxy mesa-misalignment.
Potential Research Directions
Neither the top-level directions nor the surveys of existing work are exhaustive.
Dumb decision theory
Most of these problems seem to result from our agent being "too smart." In particular, agents using updateless decision theory (UDT) or functional decision theory (FDT) will accept acausal trade deals and counterfactual muggings. Thus, one potential avenue for creating agents that do not accept such deals is by created a so-called dumb decision theory (DDT).
We desire that DDT agents...
DDT is about defining decision-theoretic myopia, which is distinct from time-limited myopia.
Existing work
Causal decision theory (CDT) satisfies (1), (2), and (4). CDT agents might satisfy (3). CDT agents violate (5) because CDT achieves lower utility on many decision problems than UDT/FDT, so CDT agents will self-modify to accept acausal trade deals and counterfactual muggings.
Casper's Achilles Heel Hypothesis suggests DDT might be a decision theory with a set of specific Achilles heels.
Armstrong's work on preventing acausal trade in Oracles: reject all deals - break superrationality, with superrationality and Acausal trade barriers suggest ways for DDT to achieve property (2).
Cohen et al.'s Asymptotically Unambitious Artificial General Intelligence plans using the maximum a posteriori world model, which might avoid anthropic uncertainty and acausal trade.
Decoupling
Myopic agents have an incentive to tamper with the physical implementations of their reward functions. For example, a myopic approval-maximizing agent has an incentive to modify brain chemistry of the overseer.
The overseer disapproves tampering actions, so an approval-maximizing agent would not want to take such actions. This suggests a possible way to alleviate the problem: "decouple" the agent's action from the approval signal so the overseer can give approval from a mental state independent of the action.
However, to approve an action, the human must receive information about that action. Any information a human receives alters their mental state, which makes perfect decoupling impossible. In practice, decoupling repackages information about an action in a less coercive way.
There is a tradeoff between the amount of information one gives the supervisor and the resulting agent's competitiveness. Adapting an example from Richard Ngo, suppose an agent is planning on writing a novel that contains a bunch of pro-AI propaganda. If the supervisor were to read the propaganda, they would be persuaded and give the agent high approval. However, if the supervisor did not have access to the propaganda, they would be worse at predicting the downstream consequences of writing the novel and correspondingly less able to provide approval. If the supervisor defaults to disapproval, this restricts the agent to actions with easily explainable and immediate positive consequences, making the agent less competitive.
More generally, many of the problems with myopia arise from agents considering their actions' policy-level implications. If these implications could be decoupled from the actions, it might be possible to train myopic agents to ignore them.
Existing Work
Uesato and Kumar et al.'s Avoiding Tampering Incentives in Deep RL via Decoupled Approval suggests giving approval feedback to queries about an action. I do not know how the overseer gives feedback, which means I do not know how this approach trades information for competitiveness.
Carey et al.'s The Incentives that Shape Behaviour suggests an agent optimizing a model of the supervisor would remove incentives to manipulate said supervisor. There are, however, several issues concerning how to train that supervisor.
Conclusion
Intuitively, agents that will never sacrifice utility now for utility later have no incentive to engage in deception. However, deceptive alignment might arise for unintuitive reasons. In particular, agents that make decisions based on superrationality or under anthropic uncertainty may choose to be deceptive despite making decisions in a myopic-seeming way. These problems suggest that our current understanding of myopia is incomplete. We conclude by suggesting two potential research directions and providing a brief survey of existing work.
see Multiverse-wide cooperation via correlated decision making – Summary for a brief explanation of superrationality and how it differs from acausal trade. ↩︎
Here, we apply our intuition that defection is a more myopic action than cooperation. ↩︎