For instance, if anything dangerous approached the AIXI's location, the human could lower the AIXI's reward, until it became very effective at deflecting danger. The more variety of things that could potentially threaten the AIXI, the more likely it is to construct plans of actions that contain behaviours that look a lot like "defend myself." [...]
It seems like you're just hardcoding the behavior, trying to get a human to cover all the cases for AIXI instead of modifying AIXI to deal with the general problem itself.
I get that you're hoping it will infer the general problem, but nothing stops it from learning a related rule like "Human sensing danger is bad.". Since humans are imperfect at sensing danger, that rule will better predict what's happening compared to the actual danger you want AIXI to model. Then it removes your fear and experiments with nuclear weapons. Hurray!
Therefore, they cannot identify "that computer running the code" with "me", and would cheerfully destroy themselves in the pursuit of their goals/reward.
Why do you think that? They would just use a deictic reference. That's what knowledge representation systems have done for the past 30 years.
Do you have a way of tweaking the AIXI or AIXI(tl) equation so that that could be accomplished?
It'd just model a world where if the machine it sees in the mirror turns off, it can no longer influence what happens.
When the function it uses to model the world becomes detailed enough, it can predict only being able to do certain things if some objects in the world survive, like the program running on that computer over there.
I don't think guided training is generally the right way to disabuse an AIXI agent of misconception we think it might get. What training amounts to is having the agent's memory begin with some carefully constructed string s0. All this does is change the agent's prior from some P based on Kolmogorov complexity to the prior P' (s) = P (s0+s | s0) (Here + is concatenation). If what you're really doing is changing the agent's prior to what you want, you should do that with self-awareness and no artificial restriction. In certain circumstances guided training might be the right method, but the general approach should be to think about what prior we want and hard-code it as effectively as possible. Taken to the natural extreme this amounts to making an AI that works on completely different principles than AIXI.
Therefore, they cannot identify "that computer running the code" with "me", and would cheerfully destroy themselves in the pursuit of their goals/reward.
I am curious as to why an AIXI like entity would need to model itself (and all its possible calculations) in order to differentiate the code it is running with the external universe.
The human in charge of a reward channel could work for initial versions, but once its intelligence grew wouldn't it know what was happening (like the box AI example - not likely to work in the long term).
I am curious as to why an AIXI like entity would need to model itself (and all its possible calculations) in order to differentiate the code it is running with the external universe.
See other posts on this problem (some of them are linked to in the post above).
The human in charge of a reward channel could work for initial versions, but once its intelligence grew wouldn't it know what was happening
At this point, the "hope" is that the AIXI will have made sufficient generalisations to keep it going.
AIXI is designed to work in a computable environment, but AIXI itself is uncomputable. Therefore it would seem problematic for AIXI to take account of either itself or another AIXI machine in the world.
How well does AIXI perform in worlds that contain AIXIs, or other uncomputable entities? How well can a computable approximation to an AIXI perform in a world that contains such computable approximations? How well can any agent perform in a world containing agents with reasoning capabilities greater, lesser, or similar to its own?
There is some discussion as to whether an AIXI-like entity would be able to defend itself (or refrain from destroying itself). The problem is that such an entity would be unable to model itself as being part of the universe: AIXI itself is an uncomputable entity modelling a computable universe, and more limited variants like AIXI(tl) lack the power to simulate themselves. Therefore, they cannot identify "that computer running the code" with "me", and would cheerfully destroy themselves in the pursuit of their goals/reward.
I've pointed out that agents of the AIXI type could nevertheless learn to defend itself in certain circumstances. These were the circumstances where it could translate bad things happening to itself into bad things happening to the universe. For instance, if someone pressed an OFF swith to turn it off for an hour, it could model that as "the universe jumps forwards an hour when that button is pushed", and if that's a negative (which is likely is, since the AIXI loses an hour of influencing the universe), it would seek to prevent that OFF switch being pressed.
That was an example of the setup of the universe "training" the AIXI to do something that it didn't seem it could do. Can this be generalised? Let's go back to the initial AIXI design (the one with the reward channel) and put a human in charge of that reward channel with the mission of teaching the AIXI important facts. Could this work?
For instance, if anything dangerous approached the AIXI's location, the human could lower the AIXI's reward, until it became very effective at deflecting danger. The more variety of things that could potentially threaten the AIXI, the more likely it is to construct plans of actions that contain behaviours that look a lot like "defend myself." We could even imagine that there is a robot programmed to repair the AIXI if it gets (mildly) damaged. The human could then reward the AIXI if it leaves that robot intact or builds duplicates or improves it in some way. It's therefore possible the AIXI could come to come to value "repairing myself", still without explicit model of itself in the universe.
It seems this approach could be extended to many of the problems with AIXI. Sure, an AIXI couldn't restrict its own computation in order to win the HeatingUp game. But the AIXI could be trained to always use subagents to deal with these kinds of games, subagents that could achieve maximal score. In fact, if the human has good knowledge of the AIXI's construction, it could, for instance, pinpoint a button that causes the AIXI to cut short its own calculation. The AIXI could then learn that pushing that button in certain circumstances would get a higher reward. A similar reward mechanism, if kept up long enough, could get it around existential despair problems.
I'm not claiming this would necessarily work - it may require a human rewarder of unfeasibly large intelligence. But it seems there's a chance that it could work. So it seems that categorical statements of the type "AIXI wouldn't..." or "AIXI would..." are wrong, at least as AIXI's behaviour is concerned. An AIXI couldn't develop self-preservation - but it could behave as if it had. It can't learn about itself - but it can behave as if it did. The human rewarder may not be necessary - maybe certain spontaneously occurring situations in the universe ("AIXI training wheels arenas") could allow the AIXI to develop these skills without outside training. Or maybe somewhat stochastic AIXI's with evolution and natural selection could do so. There is an angle connected with embodied embedded cognition that might be worth exploring there (especially the embedded part).
It seems that agents of the AIXI type may not necessarily have the limitations we assume they must.