Continually-adjusted discounted preferences

Stuart_Armstrong

A putative new idea for AI control; index here.

This is one of the more minor suggestions, just a small tweak to help solve a specific issue.

Discounting time

The issue is the strange behaviour that agents with discount rates have in respect to time.

Quickly, what probability would you put on time travel being possible?

I hope, as good Bayesians, you didn't answer 0% (those who did should look here). Let's assume, for argument sake, that you answered 0.1%.

Now assume that you have a discount rate of 10% per year (many putative agent designs use discount rates for convergence or to ensure short time-horizons, where they can have discount rates of 90% per second or even more extreme). By the end of 70 years, the utility will be discounted to roughly 0.1%. Thus, from then on (plus or minus a few years), the highest expected value action for you is to search for ways of travelling back in time, and do all you stuff then.

This is perfectly time-consistent: given these premisses, you'd want the "you in a century" to search frantically for a time-machine, as the actual expected utility increase they could achieve is tiny.

If you were incautions enough to have discount rates that go back into the past as well as the future, then you'd already be searching frantically for a time-machine, for the tiniest change of going back to the big bang and having an impact there...

Continual corrigibility

We want the agents we design to apply the discount rate looking to the future only, not towards the past. To do so, we can apply corrigibility (see also here). This allows us to change an agent's utility function, reward it (in utility) for any cost involve with the change.

The logical thing to do is to corrige the agent's utility function to something that doesn't have such an extreme value in the past. At the moment of applying corrigibility, cut off the agent's utility at the present moment, and replace the past values with something much smaller. You could just set it to zero (though as a mathematician my first instincts was to make it slope symmetrically down towards the past as it does towards the future - making the present the most important time ever!).

This correction could be applied regularly, maybe even continuously, removing the incentive to search desperately for ways to try and affect the past.

Note that this is not a perfect cure - an AI could create subagents that will research time-travel and come back to the present day to increase its current (though not future) utility, bringing extra resources. A way of reducing risk that could be to have a "maximal utility" (a bound on how high utility can go at any given point) that sharply reduces the possible impact of time-travelling subagents. This bound could be lifted going into the future, to allow the AI more freedom to increase it.

A more specific approach to dealing with subagents will be presented soon.

A more general method?

This is just a use of corrigibility to solve a specific problem, but it's very possible that there are other problems that corrigibility could be similarly successfully applied to. Anything where the form of the utility function made sense at one point, but became a drag at a later date.

A putative new idea for AI control; index here.

This is one of the more minor suggestions, just a small tweak to help solve a specific issue.

Discounting time

The issue is the strange behaviour that agents with discount rates have in respect to time.

Quickly, what probability would you put on time travel being possible?

I hope, as good Bayesians, you didn't answer 0% (those who did should look here). Let's assume, for argument sake, that you answered 0.1%.

Continual corrigibility

This correction could be applied regularly, maybe even continuously, removing the incentive to search desperately for ways to try and affect the past.

A more specific approach to dealing with subagents will be presented soon.

A more general method?

Time travel is about the worst possible example to discuss discount rates and future preferences. Your statements about what you want from an agent WRT to past, current, and future desires pretty much collapse if time travel exists, along with the commonsense definitions of the words "past, current, and future".

Additionally, 0.1% is way too high for the probability that significant agent-level time-travel exists in our universe. Like hundreds (or more) of orders of magnitude too high. It's quite correct for me to say 0% is the probability I assign to it, as that's what it is, to any reasonable rounding precision.

I'd like to hear more about how you think discounting should work in a rational agent, on more conventional topics than time travel.

I tend to think of utility as purely an instantaneous decision-making construct. For me, it's non-comparable across agents AND across time for an agent (because I don't have a good theory of agent identity over time, and because it's not necessary for decision-making). For me, utility is purely the evaluation of the potential future gameboard (universe) conditional on a choice under consideration.

Utility can't be stored, and gets re-evaluated for each decision. Memory and expectation, of course are stored and continue forward, but that's not utility, that's universe state.

Discounting works by the agent counting on less utility for rewards that come further away from the decision/evaluation point. I think it's strictly a heuristic - useful to estimate uncertainty about the future state of the agent (and the rest of the universe) when the agent can't calculate very precisely.

In any case, I'm pretty sure discounting is about the amount of utility for a given future material gain, not about the amount of utility over time.

It's also my belief that self-modifying rational agents will correct their discounting pretty rapidly for cases where it doesn't optimize their goal achievement. Even in humans, you see this routinely: it only takes a little education for most investors to increase their time horizons (i.e. reduce their discount rate for money) by 10-100 times.

Additionally, 0.1% is way too high for the probability that significant agent-level time-travel exists in our universe.

The one person I asked - Anders Sandberg - gave 1% as his first estimate. But for most low probabilities, exponential shrinkage will eventually chew up the difference. A 100 orders of magnitude - what's that, an extra 10,000 years?

0Stuart_Armstrong11y

I don't think discounting should be used at all, and that rational facts about the past and future (eg expected future wealth) should be used to get discount-like effects instead. However, there are certain agent designs (AIXI, unbounded utility maximisers, etc...) that might need discounting as a practical tool. In those cases, adding this hack could allow them to discount while reducing the negative effects. [...] Depends. Utility that sums (eg total hedonistic utilitarianism, reward-agent made into a utility maximiser, etc...) does accumulate. Some other variants have utility that accumulates non-linearly. Many non-accumulating utilities might have an accumulating component.

6

Continually-adjusted discounted preferences

6

Discounting time

Continual corrigibility

A more general method?

6

6

Continually-adjusted discounted preferences

6

Discounting time

Continual corrigibility

A more general method?

6