Comment Permalink

gjm14y30

As Manfred said, it does not appear that the results of the paper are affected by time discounting.

Let's be a bit more explicit about this. The model in the paper is that an action (or perhaps a pair (action,context)) is represented by a single natural number; this is provided to the environment and it returns a single natural number; the agent feeds that number into its utility function, and out comes a utility. The agent has measured what the environment does in response to a finite set of actions; what it cares about is the expected utility (over computable environments whose behaviour is consistent with the agent's past observations, with some not-necessarily-computable but not too ill-behaved probability distribution on them) of the environment's response to its actions.

The paper says that the expected utilities don't exist, if the utility function is unbounded and computable (or merely bounded below in absolute value by an unbounded computable function).

(Remark: It seems to me that this isn't necessarily fatal; if the agent cannot exactly repeat previous actions, and if it happens that the expected utility difference between two of the agent's actions exists, then it can still decide between actions. However, (1) the "cannot repeat previous actions" condition seems a bit artificial and (2) I'd guess that the arguments in the paper can be adjusted to show that expected utility differences are also divergent. But I could be wrong, and it would be interesting to know.)

So, anyway. How does time discounting fit into this? It seems to me that this is meant to model the immediate response of the environment to the agent's action; time doesn't come into it at all. And the conclusion is that even then -- even without considering the possible infinite future -- the relevant expectations don't exist.

The pathology described in the paper doesn't seem to me to have anything to do with not discounting in time. Turning the consequences of an action into an infinite stream rather than a single result might make things worse, but it can't possibly make them better.

Actually, that's not quite fair. Here's one way it could make them better. One way to avoid the divergence described in the paper is to have a bounded utility function. That seems troublesome to many people. But it's not so unreasonable to say that the utility you attach to what happens in any bounded region of spacetime should be bounded. So maybe there's some mileage to taking the bounded-utility case of this model (where the expectations all exist happily), then representing an agent's actual deliberations as involving (say) some kind of limit as the spacetime region gets larger, and hoping that lim { larger spacetime region } E { utility } converges even though E { lim { larger spacetime region} utility } doesn't. Which might be the case; I haven't thought about it carefully enough to have much idea how plausible that is.

In that scenario, you might well get saved by exponential time-discounting. Or even something weaker like 1/t^2. (Probably not 1/t, though.) But it seems to me that filling in the details is a big job; and I don't think it can possibly be right to assert that time discounting makes the result of the paper go away, without doing that work.

9

Values vs. parameters

9

9