Comment author: 01 February 2013 07:56:35PM *  0 points [-]

If it's specified to have a physical implementation, I think infinite-computation AIXI could actually get around dualism by predicting the behavior of its own physical implementation. That is, it computes outcomes as if the output channel (or similar complete output-determiner) is manipulated magically at the next time step, but it computes them using a causal model that has the the physical implementation still working into the future.

So it wouldn't drop a rock on its head, because even though it thinks it could send the command magically, it can correctly predict the subsequent, causal behavior of the output channel, i.e. silence after getting smooshed by a rock.

This behavior actually does require a non-limit infinity, since the amount of simulation required always grows faster than the simulating power. But I think you can pull it off with approximation schemes - certainly it works in the case of replacing exact simulation of future self-behavior with heuristics like "always silent if rock dropped on head" :D

Comment author: 06 February 2013 09:14:35PM 2 points [-]

Super hard to say without further specification of the approximation method used for the physical implementation.

Comment author: 02 February 2013 07:29:30PM 1 point [-]

Great post!

But I'm not quite sure how I feel about the reformulation in terms of semimeasures instead of deterministic programs. Part of my motivation for the environment-specific utility was to avoid privileging observations over unobservable events in the environment in terms of what the agent can care about. So I would only consider the formulation in terms of semimeasures to be satisfactory if the semimeasures are specific enough that the correct semimeasure plus the observation sequence is enough information to determine everything that's happening in the environment.

I really dislike the discounting approach, because it doesn't respect the given utility function and makes the agent miss out on potentially infinite amounts of utility.

If we're going to allow infinite episodic utilities, we'll need some way of comparing how big different nonconvergent series are. At that point, the utility calculation will not look like an infinite sum in the conventional sense. In a sense, I agree that discounting is inelegant because it treats different time steps differently, but on the other hand, I think asymmetry between different time steps is somewhat inevitable. For instance, presumably we want the agent to value getting a reward of 1 on even steps and 0 on odd steps higher than getting a reward of 1 on steps that are divisible by three and 0 on non-multiples of three. But those reward sequences can be placed in 1-to-1 correspondence. This fact causes me to grudgingly give up on time-symmetric utility functions.

one has to be careful to only use the computable subset of the set of all infinite strings $ao_{1:\infty}$

Why?

Comment author: 06 February 2013 09:11:57PM 0 points [-]

So I would only consider the formulation in terms of semimeasures to be satisfactory if the semimeasures are specific enough that the correct semimeasure plus the observation sequence is enough information to determine everything that's happening in the environment.

Can you make an example of a situation in which that would not be the case? I think the semimeasure AIXI and deterministic programs AIXI are pretty much equivalent, am I overlooking something here?

If we're going to allow infinite episodic utilities, we'll need some way of comparing how big different nonconvergent series are.

I think we need that even without infinite episodic utilities. I still think there might be possibilities involving surreal numbers, but I haven't found the time yet to develop this idea further.

Why?

Because otherwise we definitely end up with an unenumerable utility function and every approximation will be blind between infinitely many futures with infinitely large utility differences, I think. The set of all binary strings of infinite length is uncountable and how would we feed that into an enumerable/computable function? Your approach avoids that via the use of policies p and q that are by definition computable.

Comment author: 04 February 2013 07:53:22PM *  2 points [-]

I believe I have a workable solution for the duality problem, which is essentially a special case of the Orseau-Ring framework, viewed slightly differently. Consider a specific computer architecture M, equipped with an input channel for receiving inputs ostensibly from the environment (although the environment doesn't appear explicitly in the formalism) and possibly special instructions for self-reprogramming (although the latter is semi-redundant as will become clear in the following). This architecture has a state space Sigma (typically M is a universal Turing machine so Sigma is countable but it also possible to consider models with finite RAM in which case M is a finite state automaton), with some state transitions s1 -> s2 being "legitimate" while other not (note that s1 doesn't determine s2 uniquely since the input from the environment can be arbitrary). Consider also U a utility function defined on arbitrary (possibly "illegitimate") infinite histories of M i.e. functions N -> Sigma. Then an "agent" is simply an initial state of M: s in Sigma regarded as a "program". The intelligence of s is defined to be its expected utility assuming the dynamics of M to be described by a certain stochastic process. If we stop here, without specifying this stochastic process, we get more or less an equivalent formulation of the Orseau-Ring framework. By analogy with Legg-Hutter it is natural to assume this stochastic process is governed by the Solomonoff semi-measure. But if we do this we probably won't be able to get any meaningful near-optimal agents since we need to write a program without knowing how the computer works. My suggestion is deforming the Solomonoff semi-measure by assigning weight 0 < p < 1 to state transitions s1 -> s2 which are illegal in terms of M. This should make the near-optimal agents sophisticated since p < 1 so they can rely to some extent on our computer architecture M. On the other hand p > 0 so these agents have to take possible wire-heading into account. In particular they can make positive use of wire-heading to reprogram themselves even if the basic architecture M doesn't allow it, assuming of course they are placed in a universe in which it is possible.

Comment author: 06 February 2013 08:57:55PM 0 points [-]

I think you are proposing to have some hypotheses privileged in the beginning of Solomonoff induction, but not too much because the uncertainty helps fight wireheading by means of providing knowledge about the existence of an idealized, "true" utility function and world model. I that a correct summary? (Just trying to test whether I understand what you mean.)

In particular they can make positive use of wire-heading to reprogram themselves even if the basic architecture M doesn't allow it

Can you explain this more?

## Save the princess: A tale of AIXI and utility functions

14 01 February 2013 03:38PM

"Intelligence measures an agent's ability to achieve goals in a wide range of environments." (Shane Legg) [1]

A little while ago I tried to equip Hutter's universal agent, AIXI, with a utility function, so instead of taking its clues about its goals from the environment, the agent is equipped with intrinsic preferences over possible future observations.

The universal AIXI agent is defined to receive reward from the environment through its perception channel. This idea originates from the field of reinforcement learning, where an algorithm is observed and then rewarded by a person if this person approves of the outputs. It is less appropriate as a model of AGI capable of autonomy, with no clear master watching over it in real time to choose between carrot and stick. A sufficiently smart agent that is rewarded whenever a human called Bob pushes a button will most likely figure out that instead of furthering Bob's goals it can also threaten or deceive Bob into pushing the button, or get Bob replaced with a more compliant human. The reward framework does not ensure that Bob gets his will; it only ensures that the button gets pressed. So instead I will consider agents who have preferences over the future, that is, they act not to gain reward from the environment, but to cause the future to be a certain way. The agent itself will look at the observation and decide how rewarding it is.

Von Neumann and Morgenstern proved that a preference ordering that is complete, transitive, continuous and independent of irrelevant alternatives can be described using a real-valued utility function. These assumptions are mostly accepted as necessary constraints on a normatively rational agent; I will therefore assume without significant loss of generality that the agent's preferences are described by a utility function.

This post is related to previous discussion about universal agents and utility functions on LW.

Comment author: 04 January 2013 08:47:17PM 0 points [-]

That looks like it only discusses interpersonal utility comparisons. I don't see anything about intrapersonal utility comparison in the book description.

Comment author: 05 January 2013 02:01:03PM 1 point [-]

They just do interpersonal comparisons; lots of their ideas generalize to intrapersonal comparisons though.

Comment author: 04 January 2013 02:35:49PM *  1 point [-]

I recommend the book "Fair Division and Collective Welfare" by H. J. Moulin, it discusses some of these problems and several related others.

Comment author: 20 December 2012 09:18:30PM 1 point [-]

Oh right. But you still want the probability weighting to be inside the sum, so you would actually need $U(\.{y}\.{x}_{

Comment author: 20 December 2012 09:36:35PM 2 points [-]

True. :)

Comment author: 19 December 2012 08:08:35PM *  0 points [-]

If you already choose the policy ... then you cannot choose an y_k in the argmax.

The argmax comes before choosing a policy. In ${\displaystyle\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left \dot{x}_{k}\right=\dot{y}_{k}y_{k}}...}$, there is already a value for y_k before you consider all the policies such that p(x_<k) = y_<k y_k.

Also for the Solomonoff prior you must sum over all programs

Didn't I do that?

Could you maybe expand on the proof of Lemma 1 a little bit?

Look at any finite observation sequence. There exists some action you could output in response to that sequence that would allow you to get arbitrarily close to the supremum expected utility with suitable responses to the other finite observation sequences (for instance, you could get within 1/2 of the supremum). Now look at another finite observation sequence. There exists some action you could output in response to that, without changing your response to the previous finite observation sequence, such that you can get arbitrarily close to the supremum (within 1/4). Look at a third finite observation sequence. There exists some action you could output in response to that, without changing your responses to the previous 2, that would allow you to get within 1/8 of the supremum. And keep going in some fashion that will eventually consider every finite observation sequence. At each step n, you will be able to specify a policy that gets you within 2^-n of the supremum, and these policies converge to the policy that the agent actually implements.

I hope that helps. If you still don't know what I mean, could you describe where you're stuck?

Comment author: 20 December 2012 09:35:12PM 0 points [-]

I get that now, thanks.

Comment author: 19 December 2012 08:13:03PM 1 point [-]

True, the U(program, action sequence) framework can be implemented within the U(action/observation sequence) framework, although you forgot to multiply by 2^-l(q) when describing how. I also don't really like the finite look-ahead (until m_k) method, since it is dynamically inconsistent.

This solves wireheading only if we can specify which environments contain wireheaded (non-dualistic) agents, delusion boxes, etc..

Not sure what you mean by that.

Comment author: 20 December 2012 06:29:57PM 2 points [-]

you forgot to multiply by 2^-l(q)

I think then you would count that twice, wouldn't you? Because my original formula already contains the Solomonoff probability...

Comment author: 19 December 2012 08:13:03PM 1 point [-]

True, the U(program, action sequence) framework can be implemented within the U(action/observation sequence) framework, although you forgot to multiply by 2^-l(q) when describing how. I also don't really like the finite look-ahead (until m_k) method, since it is dynamically inconsistent.

This solves wireheading only if we can specify which environments contain wireheaded (non-dualistic) agents, delusion boxes, etc..

Not sure what you mean by that.

Comment author: 20 December 2012 06:25:49PM 0 points [-]

Let's stick with delusion boxes for now, because assuming that we can read off from the environment whether the agent has wireheaded breaks dualism. So even if we specify utility directly over environments, we still need to master the task of specifying which action/environment combinations contain delusion boxes to evaluate them correctly. It is still the same problem, just phrased differently.

View more: Next