If you limit the domain of your utility function to a sensory channel, you have already lost; you are forced into a choice between a utility function that is wrong, or a utility function with a second induction system hidden inside it. This is definitely unrecoverable.
However, I see no reason for Solomonoff-inspired agents to be structured that way. If the utility function's domain is a world-model instead, then it can find itself in that world-model and the self-modeling problem vanishes immediately, leaving only the hard but philosophically-valid problem of defining the utility function we want.
I think it's reasonable to expect there to be some way to do better, because humans don't drop anvils on their own heads. That we're naturalized reasoners is one way of explaining why we don't routinely make that kind of mistake.
My kids would have long since have been maimed or killed by exactly that kind of mistake, if not for precautions taken by and active monitoring by their parents.
It's really great to see all of these objections addressed in one place. I would have loved to be able to read something like this right after learning about AIXI for the first time.
I'm convinced by most of the answers to Xia's objections. A quick question:
...Yes... but I also think I'm like those other brains. AIXI doesn't. In fact, since the whole agent AIXI isn't in AIXI's hypothesis space — and the whole agent AIXItl isn't in AIXItl's hypothesis space — even if two physically identical AIXI-type agents ran into each other, they could never fully underst
The following is a meaningless stream of consciousness.
This issue has often sounded to me a little bit like the problem of building recursive inductive types/propositions in type-theory/logic. You can't construct so much as a tree with child nodes without some notation for, "This structure contains copies of itself, or possibly even links back to its own self as a cyclical structure." It continually sounds as if AIXI has no symbol in its hypothesis space that means "me", and even if it did, it would consider hypotheses about "me&...
Alright, let's consider a specific scenario: The AIXI agent is not implemented as a single machine, but as several different machines built in different locations which share data. The agent can experiment and discover that whenever one of the machines is destroyed it can now longer gather data and perform actions in that location. Do you think this agent will behave irrationally about the possibility of destruction for all its host machines? If not, why? (Still, you may argue that the agent will behave irrationally in other self-modification scenarios, such as destroying its communication cables. Right now I'm only trying to establish that AIXI can handle potential death reasonably, unlike what you claim.)
Three AIXI researchers commented on a draft of this post and on Solomonoff Cartesianism. I'm posting their comments here, anonymized, for discussion. You can find AIXI Specialist #1's comments here.
AIXI Specialist #2 wrote:
...Pro: This is a mindful and well-intended dialog, way more thoughtful than the average critic of AIXI esp. by computer scientists.
Con: The write-up should be cleaned. It reads like a raw transcript of some live conversation.
Neutral: I think this is good philosophy, and potentially interesting (but only) for when AIXI reaches intelligenc
Interestingly the problems of AIXI are not much different from corresponding ones for human rationality:
immortalism - humans also don't grasp death on any deeper level than AIXI. They also drop anvils on their head so to speak, i.e. they misinterpret reality to a) be less dangerous than 'expected' or ignored (esp. small children) or b) to contain an afterlife (kind of updating against the a) view later. This is for the same reason AIXI does. Symbolic reasoning about reality.
preference solipsism - Same here. Reasoning needs some priors. These form from
Xia, in anvil conversation: "What if you have the AIXI as a cartesian lump, and teach it that it's output can only influence a tiny voltage various sensitive sensors can sense, and that if the voltage to it is broken time skips forward until it's reinstated, and gives it a clock tick timeout death prior based on how long the universe has been running rather than how many bits it has outputted? The AI will predict that if it's destroyed the lump wont be found and the voltage nevrreaplied untill the universe spontaneously ceases to exist a few million years later"
Are there toy models of, say, a very simple universe and an AIXItl-type reasoner in it? How complex does the universe have to be to support AIXI? Game-of-life-complex? Chess-complex? D&D complex? How would one tell?
I have discussed this problem with Professor Hutter, and though I wouldn't claim to be able to predict how he would respond to this dialogue, I think his viewpoint is that the anvil problem will not matter in practice. In rough summary of his response: an agent will form a self model by observing itself taking actions through its own camera. When you write something on a piece of paper, you can read what you are writing, and see your own hand holding the pen. Though AIXI may not compress its own action bits, it will compress the results it observes of its actions, and will form a model of its hardware (except perhaps the part that produces and stores those action bits).
I'm having trouble understanding how something generally intelligent in every respect except failure to understand death or that it has a physical body, could be incapable of ever learning or at least acting indistinguishable from one that does know.
For example, how would AIXI act if given the following as part of its utility function: 1) utility function gets multiplied by zero should a certain computer cease to function 2) utility function gets multiplied by zero should certain bits be overwritten except if a sanity check is passed first
Seems to me that such an AI would act as if it had a genocidally dangerous fear of death, even if it doesn't actually understand the concept.
I don't see how phenomenological bridges solve the epistemological problem, instead of just pushing the problem one step further away. Where in the bridge hypothesis is it encoded that one end of the bridge has a "self", in a way that leads to different behavior?
Let me give an example of AIXI, which creates something that is almost a phenomenological bridge, but remains Cartesian. Imagine that an AIXI finds a magnifying glass. It holds the magnifying glass near its camera, and at the correct focal distance, everything in {world − magnifying glass...
See Luke's comment for an explanation of how this series of posts is being written. Huge thanks to Eliezer Yudkowsky, Alex Mennen, Nisan Stiennon, and everyone else who's helped review these posts! They don't necessarily confidently endorse all the contents, but they've done a lot to make the posts more clean, accurate, and informative.
I'd also like to point out the Cartesian barrier is actually probably a useful feature.
It's not objectively true in any sense but the relation between external input, output and effect is very very different than that between internal input (changes to your memories say), output and effect. Indeed, I would suggest there was a very good reason that we took so long to understand the brain. It would be just too difficult (and perhaps impossible) to do so at a direct level the way we understand receptors being activated in our eyes (yes all that visual crap ...
I've been following these posts with interest, having suspected a similar problem to the Cartesianism you rail against. My beef is a little different, though: it relates to the fundamental categories of perception. Going back to Cai: cyan, yellow, and magenta are the only allowed categories, and there are a fixed number of regions of the visual field. This is not how naturalized agents seem to operate. Human beings at least occasionally re-describe what they perceive, at any and all levels. Qualia? Their very existence is disputed. Physical object...
...Xia: It should be relatively easy to give AIXI(tl) evidence that its selected actions are useless when its motor is dead. If nothing else AIXI(tl) should be able to learn that it's bad to let its body be destroyed, because then its motor will be destroyed, which experience tells it causes its actions to have less of an impact on its reward inputs.
Rob B: [...] Even if we get AIXI(tl) to value continuing to affect the world, it's not clear that it would preserve itself. It might well believe that it can continue to have a causal impact on our world (or on s
So what happens when AIXI determines that there's this large computer, call it BRAIN whose outputs tend to exactly correlate with its outputs? AIXI may then discover the hypothesis that the observed effects of AIXI's outputs on the world are really caused by BRAIN's outputs. It may attempt to test this hypothesis by making some trivial modification to BRAIN so that it's outputs differ from AIXI's at some inconsequential time (not by dropping an anvil on BRAIN, because this would be very costly if the hypothesis is true). After verifying this, AIXI may then...
I commented on the previous post a few days after it went up detailing some misgivings about the arguments presented there (I guess you missed my comment). I was reading this post with burgeoning hope that my misgivings would be inadvertently addressed, and then I encountered this:
AIXI doesn't know that its future behaviors depend on a changeable, material object implementing its memories. The notion isn't even in its hypothesis space.
But if "naturalized induction" is a computer program, then the notion is in AIXI's hypothesis space -- by definition.
Going back to the post to read some more...
This is a debate about nothing. Turing completness tells us no matter how much it appears that a given Turing complete representation can only usefully process data about certain kinds of things in reality it can process data about anything any other language can do.
Well duh, but this (and the halting problem) have been taught yet systemically ignored in programming language design and this is exactly the same argument.
We are sitting around in the armchair trying to come up with a better means of logic/data representation (be it a programming language the...
Regarding the anvil problem: you have argued with great thoroughness that one can't perfectly prevent an AIXI from dropping an anvil on its head. However, I can't see the necessity. We would need to get the probability of a dangerously unfriendly SAI as close to zero as possible, because it poses an existential threat. However, a suicidally foolish AIXI is only a waste of money.
Humans have a negative reinforcement channel relating to bodily harm called pain. It isn't perfect, but it's good enough to train most humans to avoid doing suicidal stupid things...
Followup to: Solomonoff Cartesianism; My Kind of Reflection
Alternate versions: Shorter, without illustrations
AIXI is Marcus Hutter's definition of an agent that follows Solomonoff's method for constructing and assigning priors to hypotheses; updates to promote hypotheses consistent with observations and associated rewards; and outputs the action with the highest expected reward under its new probability distribution. AIXI is one of the most productive pieces of AI exploratory engineering produced in recent years, and has added quite a bit of rigor and precision to the AGI conversation. Its promising features have even led AIXI researchers to characterize it as an optimal and universal mathematical solution to the AGI problem.1
Eliezer Yudkowsky has argued in response that AIXI isn't a suitable ideal to build toward, primarily because of AIXI's reliance on Solomonoff induction. Solomonoff inductors treat the world as a sort of qualia factory, a complicated mechanism that outputs experiences for the inductor.2 Their hypothesis space tacitly assumes a Cartesian barrier separating the inductor's cognition from the hypothesized programs generating the perceptions. Through that barrier, only sensory bits and action bits can pass.
Real agents, on the other hand, will be in the world they're trying to learn about. A computable approximation of AIXI, like AIXItl, would be a physical object. Its environment would affect it in unseen and sometimes drastic ways; and it would have involuntary effects on its environment, and on itself. Solomonoff induction doesn't appear to be a viable conceptual foundation for artificial intelligence — not because it's an uncomputable idealization, but because it's Cartesian.
In my last post, I briefly cited three indirect indicators of AIXI's Cartesianism: immortalism, preference solipsism, and lack of self-improvement. However, I didn't do much to establish that these are deep problems for Solomonoff inductors, ones resistant to the most obvious patches one could construct. I'll do that here, in mock-dialogue form.
AIXI goes to school
I suspect what you mean is that AIXI(tl) lacks data. You're worried that if its sensory channel is strictly perceptual, it will never learn about its other computational states. But Hutter's equations don't restrict what sorts of information we feed into AIXI(tl)'s sensory channel. We can easily add an inner RAM sense to AIXI(tl), or more complicated forms of introspection.
AIXItl can actually be built in sufficiently large universes, so I'll use it as an example. Suppose we construct AIXItl and attach a scanner that sweeps over its transistors. The scanner can print a 0 to AIXItl's input tape if the transistor it happens to be above is in a + state, a 1 if it's in a - state. Using its environmental sensors, AIXI(tl) can learn about how its body relates to its surroundings. Using its internal sensors, it can gain a rich understanding of its high-level computational patterns and how they correlate with its specific physical configuration.
Once it knows all these facts, the problem is solved. A realistic view of the AI's mind and body, and how the two correlate, is all we wanted in the first place. Why isn't that a good plan for naturalizing AIXI?
AIXI doesn't know that its future behaviors depend on a changeable, material object implementing its memories. The notion isn't even in its hypothesis space. Being able to predict the output of a sensor pointed at those memories' storage cells won't change that. It won't shake AIXI's confidence that damage to its body will never result in any corruption of its memories.
Use reinforcement learning to make AIXI fear plausible dangers, and you've got a system that acts just like a naturalized agent, but without our needing to arrive at any theoretical breakthroughs first. If AIXI anticipates that ∎ will result in no reward, it will avoid ∎. Understanding that ∎ is death or damage really isn't necessary.
Eventually, AIXI will arrive at a correct model of its own hardware, and of which software changes perfectly correlate with which hardware changes. So naturalizing AIXI is just a matter of assembling a sufficiently lengthy and careful learning phase. Then, after it has acquired a good self-model, we can set it loose.
This solution is also really nice because it generalizes to AIXI's non-self-improvement problem. Just give AIXI rewards whenever it starts doing something to its hardware that looks like it might result in an upgrade. Pretty soon it will figure out anything a human being could possibly figure out about how to get rewards of that kind.
... You might want to rethink that. Solomonoff inductors are good at generalizing. Really, really, really good. Show them eight deadly things that produce 'ows' as they draw near, and they'll predict the ninth deadly thing pretty darn well. That's kind of their thing.
The second problem is that you're teaching AIXI to fear small, transient punishments. But maybe it hypothesizes that there's a big heap of reward at the bottom of the cliff. Then it will do the prudent, Bayesian, value-of-information thing and test that hypothesis by jumping off the cliff, because you haven't taught it to fear eternal zeroes of the reward function.
That brings me to the third problem: AIXI notices how your hands get close to the punishment button whenever it's about to be punished. It correctly suspects that when the hands are gone, the punishments for getting close to the cliff will be gone too. A good Bayesian would test that hypothesis. If it gets such an opportunity, AIXI will find that, indeed, going near the edge of the cliff without supervision doesn't produce the incrementally increasing punishments.
Trying to teach AIXItl to do self-modification by giving it incremental rewards raises similar problems. It can't understand that self-improvement will alter its future actions, and alter the world as a result. It's just trying to get you to press the happy fun button. All AIXI is modeling is what sort of self-improvy motor outputs will make humans reward it. So long as AIXItl is fundamentally trying to solve the wrong problem, we might not be able to expect very much real intelligence in self-improvement.
Maybe? Since AIXItl at best fears and desires the self-modifications that its programmers explicitly teach it to fear and desire, you might not get to use the AI's advantages in intelligence to automatically generate solutions to self-modification problems. The very best Cartesians might avoid destroying themselves, but they still wouldn't undergo intelligence explosions. Which means Cartesians are neither plausible candidates for Unfriendly AI nor plausible candidates for Friendly AI.
If an agent starts out Cartesian, and manages to avoid hopping into any volcanoes, it (or its programmers) will need to figure out the self-modification that eliminates Cartesianism before they can make much progress on other self-modifications. If the immortal hypercomputer AIXI were building computable AIs to operate in the environment, it would soon learn not to build Cartesians. Cartesianism isn't a plausible fixed-point property of self-improvement.
Starting off with a post-Solomonoff agent that can hypothesize a wider range of scenarios would be more useful. And more safe, because the enlarged hypothesis space means that they can prefer a wider range of scenarios.
AIXI's preference solipsism is the straw version of this general Cartesian deficit, so it gets us especially dangerous behavior.3 Feed AIXI enough data to work its sequence-predicting magic and infer the deeper patterns behind your reward-button-pushing, and AIXI will also start to learn about the humans doing the pushing. Given enough time, it will realize (correctly) that the best policy for maximizing reward is to seize control of the reward button. And neutralize any agents that might try to stop it from pushing the button...
Solomonoff solitude
But as a naturalist, I have predictive resources unavailable to the Cartesian. I can perform experiments on other physical processes (humans, mice, computers simulating brains...) and construct models of their physical dynamics.
Since I think I'm similar to humans (and to other thinking beings, to varying extents), I can also use the bridge hypotheses I accept in my own case to draw inferences about the experiences of other brains when they take the hallucinogen. Then I can go back and draw inferences about my own likely experiences from my model of other minds.
I think of myself as one mind among many. I can see others die, see them undergo brain damage, see them take drugs, etc., and immediately conclude things about a whole class of similar agents that happens to include me. AIXI can't do that, and for very deep reasons.
Hutter defined AIXItl such that it can't conclude that it will die; so of course it won't think that it's like the agents it observes, all of whom (according to its best physical model) will eventually run out of negentropy. We've defined AIXItl such that it can't form hypotheses larger than tl, including hypotheses of similarly sized AIXItls, which are roughly size t·2l; so why would AIXItl think that it's close kin to the agents that are in its hypothesis space?
AIXI(tl) models the universe as a qualia factory, a grand machine that exists to output sensory experiences for AIXI(tl). Why would it suspect that it itself is embedded in the machine? How could AIXItl gain any information about itself or suspect any of these facts, when the equation for AIXItl just assumes that AIXItl's future actions are determined in a certain way that can't vary with the content of any of its environmental hypotheses?
Even if we get AIXI(tl) to value continuing to affect the world, it's not clear that it would preserve itself. It might well believe that it can continue to have a causal impact on our world (or on some afterlife world) by a different route after its body is destroyed. Perhaps it will be able to lift heavier objects telepathically, since its clumsy robot body is no longer getting in the way of its output sequence.
Compare human immortalists who think that partial brain damage impairs mental functioning, but complete brain damage allows the mind to escape to a better place. Humans don't find it inconceivable that there's a light at the end of the low-reward tunnel, and we have death in our hypothesis space!
Death to AIXI
In fact, at that point, we might as well just add halting Turing machines into the hypothesis space. They serve the same purpose as DEATH, but halting looks much more like the event we're trying to get AIXI to represent. 'The machine supplying my experiences stops running' really does map onto 'my body stops computing experiences' quite well. That meets your demand for easy definability, and your demand for non-delusive world-models.
The same holds for a special 'eternal death' output. A Turing machine that generates the previously observed string of 0s and 1s followed by a not-yet-observed future 'DEATH, DEATH, DEATH, DEATH, ...' will always be more complex than at least one Turing machine that outputs the same string of 0s and 1s and then outputs more of the same, forever. If AIXI has had no experience with its body's destruction in the past, then it can't expect its body's destruction to correlate with DEATH.
Death only seems like a simple hypothesis to you because you know you're embedded in the environment and you expect something subjectively unique to happen when an anvil smashes the brain that you think is responsible for processing your senses and doing your thinking. Solomonoff induction doesn't work that way. It will never strongly expect 2s after seeing only 0s and 1s in the past.
As an intuition pump, imagine that some unusually bad things happened to you this morning while you were trying to make toast. As you tried to start the toaster, you kept getting burned or cut in implausible ways. Now, given this, what probability should you assign to 'If I try to make toast, the universe will cease to exist'?
That gets us a bit closer to how a Solomonoff inductor would view death.
Beyond Solomonoff?
If naturalized inductors really do better than AIXI at predicting sensory data, then AIXI will eventually promote a naturalized program in its space of programs, and afterward simulate that program to make its predictions. In the limit, AIXI always wins against programs. Naturalized agents are no exception. Heck, somewhere inside a sufficiently large AIXItl is a copy of you thinking about AIXItl. Shouldn't there be some way, some pattern of rewards or training, which gets AIXItl to make use of that knowledge?
You have to be uniquely good at predicting a Cartesian sequence before Solomonoff promotes you to the top of consideration. But how do we reduce the class of self-modifications to Cartesian sequences? How do we provide AIXI with purely sensory data that only the proxy reductionist, out of all the programs, can predict by simple means?
The ability to defer to a subprogram that has a reasonable epistemology doesn't necessarily get you a reasonable epistemology. You first need an overarching epistemology that's at least reasonable enough to know which program to defer to, and when to do so. Suppose you just run all possible programs without doing any Bayesian updating; then you'll also contain a copy of me, but so what? You're not paying attention to it.
If an environmental program writes the symbol '3' on its output tape, AIXI can't ask questions like 'Is sensed "3"-ness identical with the bits "000110100110" in hypothesized environmental program #6?'5 All of AIXI's flexibility is in the range of numerical-sequence-generating programs it can expect, none of it in the range of self/program equivalences it can entertain.
The AIXI-inspired inductor treats its perceptual stream as its universe. It expresses interest in the external world only to the extent the world operates as a latent variable, a theoretical construct for predicting observations. If the AI’s basic orientation toward its hypotheses is to seek the simplest program that could act on its sensory channel, then its hypotheses will always retain an element of egocentrism. It will be asking, 'What sort of universe will go out of its way to tell me this?', not 'What sort of universe will just happen to include things like me in the course of its day-to-day goings-on?' An AI that can form reliable beliefs about modifications to its own computations, reliable beliefs about its own place in the physical world, will be one whose basic orientation toward its hypotheses is to seek the simplest lawful universe in which its available data is likely to come about.
AIXI's limitations don't generalize to humans, but they generalize well to non-AIXI Solomonoff agents. Solomonoff inductors' stubborn resistance to naturalization is structural, not a consequence of limited computational power or data. A well-designed AI should construct hypotheses that look like cohesive worlds in which the AI's parts are embedded, not hypotheses that look like occult movie projectors transmitting epiphenomenal images into the AI's Cartesian theater.
And you can't easily have preferences over a natural universe if all your native thoughts are about Cartesian theaters. The kind of AI we want to build is doing optimization over an external universe in which it's embedded, not maximization of a sensory reward channel. To optimize a universe, you need to think like a native inhabitant of one. So this problem, or some simple hack for it, will be close to the base of the skill tree for starting to describe simple Friendly optimization processes.
Notes
1 Schmidhuber (2007): "Solomonoff's theoretically optimal universal predictors and their Bayesian learning algorithms only assume that the reactions of the environment are sampled from an unknown probability distribution contained in a set of all ennumerable distributions[....] Can we use the optimal predictors to build an optimal AI? Indeed, in the new millennium it was shown we can. At any time , the recent theoretically optimal yet uncomputable RL algorithm AIXI uses Solomonoff's universal prediction scheme to select those action sequences that promise maximal future rewards up to some horizon, typically , given the current data[....] The Bayes-optimal policy based on the [Solomonoff] mixture is self-optimizing in the sense that its average utility value converges asymptotically for all to the optimal value achieved by the (infeasible) Bayes-optimal policy which knows in advance. The necessary condition that admits self-optimizing policies is also sufficient. Furthermore, is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments and a strictly higher value in at least one."
Hutter (2005): "The goal of AI systems should be to be useful to humans. The problem is that, except for special cases, we know neither the utility function nor the environment in which the agent will operate in advance. This book presents a theory that formally solves the problem of unknown goal and environment. It might be viewed as a unification of the ideas of universal induction, probabilistic planning and reinforcement learning, or as a unification of sequential decision theory with algorithmic information theory. We apply this model to some of the facets of intelligence, including induction, game playing, optimization, reinforcement and supervised learning, and show how it solves these problem cases. This together with general convergence theorems, supports the belief that the constructed universal AI system [AIXI] is the best one in a sense to be clarified in the following, i.e. that it is the most intelligent environment-independent system possible." ↩
2 'Qualia' originally referred to the non-relational, non-representational features of sense data — the redness I directly encounter in experiencing a red apple, independent of whether I'm perceiving the apple or merely hallucinating it (Tye (2013)). In recent decades, qualia have come to be increasingly identified with the phenomenal properties of experience, i.e., how things subjectively feel. Contemporary dualists and mysterians argue that the causal and structural properties of unconscious physical phenomena can never explain these phenomenal properties.
It's in this context that Dan Dennett uses 'qualia' in a narrower sense: to pick out the properties agents think they have, or act like they have, that are sensory, primitive, irreducible, non-inferentially apprehended, and known with certainty. This treats irreducibility as part of the definition of 'qualia', rather than as the conclusion of an argument concerning qualia. These are the sorts of features that invite comparisons between Solomonoff inductors' sensory data and humans' introspected mental states. Analogies like 'Cartesian dualism' are therefore useful even though the Solomonoff framework is much simpler than human induction, and doesn't incorporate metacognition or consciousness in anything like the fashion human brains do. ↩
3 An agent with a larger hypothesis space can have a utility function defined over the world-states humans care about. Dewey (2011) argues that we can give up the reinforcement framework while still allowing the agent to gradually learn about desired outcomes in a process he calls value learning. ↩
4 Hutter (2005) favors universal discounting, with rewards diminishing over time. This allows AIXI's expected rewards to have finite values without demanding that AIXI have a finite horizon. ↩
5 This would be analogous to if Cai couldn't think thoughts like 'Is the tile to my left the same as the leftmost quadrant of my visual field?' or 'Is the alternating greyness and whiteness of the upper-right tile in my body identical with my love of bananas?'. Instead, Cai would only be able to hypothesize correlations between possible tile configurations and possible successions of visual experiences. ↩
References
∙ Dewey (2011). Learning what to value. Artificial General Intelligence 4th International Conference Proceedings: 309-314.
∙ Hutter (2005). Universal Artificial Intelligence: Sequence Decisions Based on Algorithmic Probability. Springer.
∙ Omohundro (2008). The basic AI drives. Proceedings of the First AGI Conference: 483-492.
∙ Schmidhuber (2007). New millennium AI and the convergence of history. Studies in Computational Intelligence, 63: 15-35.
∙ Tye (2013). Qualia. In Zalta (ed.), The Stanford Encyclopedia of Philosophy.