Great post!
But I'm not quite sure how I feel about the reformulation in terms of semimeasures instead of deterministic programs. Part of my motivation for the environment-specific utility was to avoid privileging observations over unobservable events in the environment in terms of what the agent can care about. So I would only consider the formulation in terms of semimeasures to be satisfactory if the semimeasures are specific enough that the correct semimeasure plus the observation sequence is enough information to determine everything that's happening in the environment.
I really dislike the discounting approach, because it doesn't respect the given utility function and makes the agent miss out on potentially infinite amounts of utility.
If we're going to allow infinite episodic utilities, we'll need some way of comparing how big different nonconvergent series are. At that point, the utility calculation will not look like an infinite sum in the conventional sense. In a sense, I agree that discounting is inelegant because it treats different time steps differently, but on the other hand, I think asymmetry between different time steps is somewhat inevitable. For instance, presumably we want the agent to value getting a reward of 1 on even steps and 0 on odd steps higher than getting a reward of 1 on steps that are divisible by three and 0 on non-multiples of three. But those reward sequences can be placed in 1-to-1 correspondence. This fact causes me to grudgingly give up on time-symmetric utility functions.
one has to be careful to only use the computable subset of the set of all infinite strings
Why?
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
If it's specified to have a physical implementation, I think infinite-computation AIXI could actually get around dualism by predicting the behavior of its own physical implementation. That is, it computes outcomes as if the output channel (or similar complete output-determiner) is manipulated magically at the next time step, but it computes them using a causal model that has the the physical implementation still working into the future.
So it wouldn't drop a rock on its head, because even though it thinks it could send the command magically, it can correctly predict the subsequent, causal behavior of the output channel, i.e. silence after getting smooshed by a rock.
This behavior actually does require a non-limit infinity, since the amount of simulation required always grows faster than the simulating power. But I think you can pull it off with approximation schemes - certainly it works in the case of replacing exact simulation of future self-behavior with heuristics like "always silent if rock dropped on head" :D
Super hard to say without further specification of the approximation method used for the physical implementation.