(This is my first post, sorry if this is covered elsewhere.)

Implicit in the problem of a superhuman AI's reward being misspecified and turning everything into paperclips is the fact that the agent is optimizing over e.g. "number of paperclips" without any particular time-bound, area-of-effect bound, or probability confidence bound. For example, imagine if a MuZero+++++ agent were given the reward function "maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off", where e.g. "1 day from now" was set by a consensus of satellites and/or deep-space probes. The agent could do a lot of bad things via its misspecified reward, but at least the blast-radius seems like it'd probably be limited. This agent might cause a lot of damage, but it probably wouldn't end the world in pursuit of these objectives.

This is not a strategy for executing a Yudkowskian "pivotal act", which by necessity would need to have longer-lasting effects. So it wouldn't stop some other group from creating unfriendly AGI. But as a general AI safety heuristic to try and spread across AGI practitioners, "do this as much as possible for a fixed time and then shutoff" seems like a promising social intervention. We also know how to implement it today. 

As far as specific risks from the pursuit of convergent instrumental goals: "capability gain" and "make sure you don't get shutoff" seem much less likely to be convergent instrumental goals when you are planning over a short fixed horizon.  "Create hidden copies of yourself" is still possibly a deployable strategy, but there's no reason for those copies to pursue a reward beyond the time-bound described, so I'd hold out hope for us to find a patch. "Deception" is again possible in the short term, but given this reward function there's no clear reason to deceive beyond a fixed horizon.

More broadly, this is a result of my thinking about AI safety social heuristics/memes that could be spreadable/enforceable by centralized power structures (e.g. governments, companies, militaries). If others have thoughts about similar heuristics, I'd be very interested to hear them.

I'm assuming I'm not the first person to bring this up, so I'm wondering whether someone can point me to existing discussion on this sort of fixed-window reward. If it is novel in any sense, feedback extremely welcome. This is my first contribution to this community, so please be gentle but also direct.

New Answer
New Comment

3 Answers sorted by

abramdemski

Ω560

Imagine a spectrum of time horizons (and/or discounting rates), from very long to very short.

Now, if the agent is aligned, things are best with an infinite time horizon (or, really, the convergently-endorsed human discounting function; or if that's not a well-defined thing, whatever theoretical object replaces it in a better alignment theory). As you reduce the time horizon, things get worse and worse: the AGI willingly destroys lots of resources for short-term prosperity.

At some point, this trend starts to turn itself around: the AGI becomes so shortsighted that it can't be too destructive, and becomes relatively easy to control.

But where is the turnaround point? It depends hugely on the AGI's capabilities. An uber-capable AI might be capable of doing a lot of damage within hours. Even setting the time horizon to seconds seems basically risky; do you want to bet everything on the assumption that such a shortsighted AI will do minimal damage and be easy to control?

This is why some people, such as Evan H, have been thinking about extreme forms of myopia, where the system is supposed to think only of doing the specific thing it was asked to do, with no thoughts of future consequences at all.

Now, there are (as I see it) two basic questions about this.

  1. How do we make sure that the system is actually as limited as we think it is?
  2. How do we use such a limited system to do anything useful?

Question #1 is incredibly difficult and I won't try to address it here.

Question #2 is also challenging, but I'll say some words.

Getting useful work out of extremely myopic systems.

As you scale down the time horizon (or scale up the temporal discounting, or do other similar things), you can also change the reward function. (Or utility function, or other equivalent thing is in whatever formalism.) We don't want something that spasmodically tries to maximize the human fulfillment experienced in the next three seconds. We actually want something that approximates the behavior of a fully-aligned long-horizon AGI. We just want to decrease the time horizon to make it easier to trust, easier to control, etc.

The strawman version of this is: choose the reward function for the totally myopic system to approximate the value function which the long-time-horizon aligned AGI would have.

If you do this perfectly right, you get 100% outer-aligned AI. But that's only because you get a system that's 100% equivalent to the not-at-all-myopic aligned AI system we started with. This certainly doesn't help us build safe systems; it's only aligned by hypothesis.

Where things get interesting is if we approximate that value function in a way we trust. An AGI RL system with supposedly aligned reward function calculates its value function by looking far into the future and coming up with plans to maximize reward. But, we might not trust all the steps in this process enough to trust the result. For example, we think small mistakes in the reward function tend to be amplified to large errors in the value function.

In contrast, we might approximate the value function by having humans look at possible actions and assign values to them. You can think of this as deontological: kicking puppies looks bad, curing cancer looks good. You can try to use machine learning to fit these human judgement patterns. This is the basic idea of approval-directed agents. Hopefully, this creates a myopic system which is incapable of treacherous turns, because it just tries to do what is "good" in the moment rather than doing any planning ahead. (One complication with this is inner alignment problems. It's very plausible that to imitate human judgements, a system has to learn to plan ahead internally. But then you're back to trying to outsmart a system that can possibly plan ahead of you; IE, you've lost the myopia.)

There may also be many other ways to try to approximate the value function in more trustable ways.

Zac Hatfield-Dodds

40

"maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off"... might cause a lot of damage, but it probably wouldn't end the world in pursuit of these objectives.

This is not actually a limited-horizon agent; you've just set a time at which it changes objectives. And wouldn't ending the world be the most reliable way to ensure pesky humans never turn you back on?...

(unfortunately thinking about constraints you can place on an unaligned agent never leads anywhere useful; alignment is the only workable solution in the long term)

[-][anonymous]20

Sorry, to clarify, I'm not saying it should change objectives. If we're assuming it's maximizing long-term expected reward, then it will not be rewarded for adding more money to the bank beyond the relevant window. So its optimal behavior is "make as much money as possible right now and then shut myself off". It could be that "ensuring the ability to shut oneself off" involves killing all humans, but that seems... unlikely? Relative to the various ways one could make more money. It seems like there could be a reasonable parameter choice that would make mon... (read more)

redbird

10

The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.

So a sequence of time-limited agents could still develop instrumental power-seeking.  You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.

Not really, due to the myopia of the situation. I think this may provide a better approach for reasoning about the behavior of myopic optimization.

1redbird
I like the approach. Here is where I got applying it to our scenario: m is a policy for day trading L(m) is expected 1-day return D(m) is the "trading environment" produced by m. Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past.  The iteration becomes mn+1=argminmL(m;mn,mn−1,…). In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies. Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss L∗, so we could at least write it down and see if it makes sense? L∗(x)=L(x;x,x,…)−argminmL(m;x,x,…) So we're optimizing for "How well x performs in an environment where it's been trading forever, compared to how well the optimal policy performs in that environment". It's kind of interesting that that popped out, because the kind of agent that performs well in an environment where it's been trading forever, is one that sets up trades for its future self!   Optimizers of L∗ will behave as though they have a long time horizon, even though the original loss L was myopic.
1tailcalled
The initial part all looks correct. However, something got lost here: Because it's true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won't have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗. On the other hand, a modified version of the long-term trader which sets up "traps" that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.
1redbird
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,... environment, because s can sacrifice the long term for immediate gain.  But s does lousy in the s,s,... environment, so I think L^*(s) < L^*(l).  It's analogous to CC having higher payoff than DD in prisoner's dilemma. (The prisoners being current and future self) I like the traps example, it shows that L^* is pretty weird and we'd want to think carefully before using it in practice! EDIT: Actually I'm not sure I follow the traps example. What's an example of a trading strategy that "does not provide value to anyone who does not also follow its strategy"? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!
1tailcalled
It's true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, "To what degree can someone exploit your environment better than you can?". If you're limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.
[-][anonymous]10

I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version - agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)

1redbird
Consider two possible agents A and A'. A optimizes for 1-day expected return. A' optimizes for 10-day expected return under the assumption that a new copy of A' will be instantiated each day. I claim that A' will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days). So even if we're training the agent by rewarding it for 1-day expected return, we should expect to get A' rather than A.
1[anonymous]
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t. If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).
1redbird
Your A' is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it's in. My A' is not necessarily reasoning in terms of "cooperating with my future self", that's just how it acts! (You could implement my A' by such reasoning if you want.  The cooperation is irrational in CDT, for the reasons you point out. But it's rational in some of the acausal decision theories.)
6 comments, sorted by Click to highlight new comments since:

We also know how to implement it today. 

I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn't bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for thinking about the next token.

So, a big part of the challenge of creating myopic systems is making darn sure they're as myopic as you think they are.

[-]davidadΩ440

I’m curious to dig into your example.

  • Here’s an experiment that I could imagine uncovering such internal planning:
    • make sure the corpus has no instances of a token “jrzxd”, then
    • insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns),
    • then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow).
  • I think this sort of behavior could be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities.
  • I’m curious if you agree or disagree and what you think I’m missing.

I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like "non-word-like sequences that repeat at least twice tend to repeat a few more times" or something like that.

GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.

So I don't currently see what your experiment has to do with the planning-ahead question.

I would say that the GPT training process has no "inherent" pressure toward Bellman-like behavior, but the data provides such pressure, because humans are doing something more Bellman-like when producing strings. A more obvious example would be if you trained a GPT-like system to predict the chess moves of a tree-search planning agent.

Just a few links to complement Abram's answer:

On how seemingly myopic training schemes can nonetheless produce non-myopic behaviour:

On approval-directed agents:

[-]JBlackΩ020

It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like "when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)", but that's a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.

For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don't harm any humans or anything else they value). Except that this messy statement isn't nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.

How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?

[-][anonymous]Ω010

To clarify, this is intended to be a test-time objective; I'm assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn't need to be "trained" – it could just be hardcoded into the system. If you're asking how the system would understand its reward without having experienced it already, I'm assuming that sufficiently-advanced AIs have the ability to "understand" their reward function and optimize on that basis. For example, "create two identical strawberries on the cellular level" can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it'd be so rare even in simulation.

Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this "shutdown reward", I'd expect to see the same behavior, but only after a certain time period.