Can AIXI be trained to do anything a human can?
There is some discussion as to whether an AIXI-like entity would be able to defend itself (or refrain from destroying itself). The problem is that such an entity would be unable to model itself as being part of the universe: AIXI itself is an uncomputable entity modelling a computable universe, and more limited variants like AIXI(tl) lack the power to simulate themselves. Therefore, they cannot identify "that computer running the code" with "me", and would cheerfully destroy themselves in the pursuit of their goals/reward.
I've pointed out that agents of the AIXI type could nevertheless learn to defend itself in certain circumstances. These were the circumstances where it could translate bad things happening to itself into bad things happening to the universe. For instance, if someone pressed an OFF swith to turn it off for an hour, it could model that as "the universe jumps forwards an hour when that button is pushed", and if that's a negative (which is likely is, since the AIXI loses an hour of influencing the universe), it would seek to prevent that OFF switch being pressed.
That was an example of the setup of the universe "training" the AIXI to do something that it didn't seem it could do. Can this be generalised? Let's go back to the initial AIXI design (the one with the reward channel) and put a human in charge of that reward channel with the mission of teaching the AIXI important facts. Could this work?
For instance, if anything dangerous approached the AIXI's location, the human could lower the AIXI's reward, until it became very effective at deflecting danger. The more variety of things that could potentially threaten the AIXI, the more likely it is to construct plans of actions that contain behaviours that look a lot like "defend myself." We could even imagine that there is a robot programmed to repair the AIXI if it gets (mildly) damaged. The human could then reward the AIXI if it leaves that robot intact or builds duplicates or improves it in some way. It's therefore possible the AIXI could come to come to value "repairing myself", still without explicit model of itself in the universe.
It seems this approach could be extended to many of the problems with AIXI. Sure, an AIXI couldn't restrict its own computation in order to win the HeatingUp game. But the AIXI could be trained to always use subagents to deal with these kinds of games, subagents that could achieve maximal score. In fact, if the human has good knowledge of the AIXI's construction, it could, for instance, pinpoint a button that causes the AIXI to cut short its own calculation. The AIXI could then learn that pushing that button in certain circumstances would get a higher reward. A similar reward mechanism, if kept up long enough, could get it around existential despair problems.
I'm not claiming this would necessarily work - it may require a human rewarder of unfeasibly large intelligence. But it seems there's a chance that it could work. So it seems that categorical statements of the type "AIXI wouldn't..." or "AIXI would..." are wrong, at least as AIXI's behaviour is concerned. An AIXI couldn't develop self-preservation - but it could behave as if it had. It can't learn about itself - but it can behave as if it did. The human rewarder may not be necessary - maybe certain spontaneously occurring situations in the universe ("AIXI training wheels arenas") could allow the AIXI to develop these skills without outside training. Or maybe somewhat stochastic AIXI's with evolution and natural selection could do so. There is an angle connected with embodied embedded cognition that might be worth exploring there (especially the embedded part).
It seems that agents of the AIXI type may not necessarily have the limitations we assume they must.
Reinforcement and Short-Term Rewards as Anti-Akratic
Related: Time and Effort Discounting, Akrasia, Hyperbolic Discounting, and Picoeconomics, The Power of Reinforcement, Basics of Animal Reinforcement, Basics of Human Reinforcement
I built a robot that feeds me candy when I get work done, to try to solve my akrasia problem. And, so far, it seems like it might actually work.
Naturally, the story starts with procrastination. I finish things the night before they're due. Or, sometimes, I don't. I'd like to fix that. One theory explains procrastination as a result of discounting, the idea that human brains discount long-term rewards in favor of short-term ones. For instance, my brain prefers watching Neon Genesis Evangelion now over nearly missing my project deadline in a few days. The same principle applies to consequences, and there are already tools like BeeMinder that are built to combat it. Its tagline, "bring long-term consequences near," is a very concise description of a clever way to short-circuit discounting. It's very interesting, but I'm not really comfortable with paying money as a consequence. Instead, I'm going to try a similar technique: bringing long-term rewards near.
There are already a lot of techniques about bringing long-term rewards near. Generally, they're called reinforcement learning. The classic reward in reinforcement is candy, which seems like a good idea: I like it, and I'm more than willing to abuse my youthful metabolism for productivity. And, in fact, there are a wide variety of folk solutions of that sort - advice to reward yourself with some candy once your work is done. I've tried those already, but they never seem to work out for me - I always seem to wind up cheating. I need to do something trickier.
CFAR describes reinforcement in a very striking way in some of their course materials: they call it "training your inner pigeon." Not only is that a nice, snappy turn of phrase, it illustrates the problem with attempting to self-administer rewards very nicely. Did Skinner's pigeons self-administer their rewards? No, of course they didn't. I shouldn't expect my inner pigeon to, either. So, my next step is to build a robot that gives me candy when I get stuff done.
Why do I think I can keep from cheating on the machine, when I couldn't restrain myself from cheating on regular old bags of candy? Well, I'm far from certain; it's my biggest worry with the project, in fact. But I am reasonably confident, because the machine will give me an easy way to establish a Schelling fence. Where taking a handful of candy out of the bag is sometimes right and sometimes wrong, taking a handful of candy out of the hopper is always wrong, since the machine will dispense the candy when I deserve it. Precommitting to never take candy out of the machine seems like it'll be a lot easier than precommitting to only sometimes take candy out of the bag.
Now, the description "robot" for my machine is a bit fanciful. It's actually an automatic dog feeder, modified and connected to the Internet. It has a small screen mounted on the front, which tells me how many rewards I've earned. If I've got any, I can press a button on the screen to dispense them. Not counting parts I already owned, the device cost me around $50 to build. To provide the data, I linked the device to an earlier productivity hack that I already had around, a custom webapp integrating a task list with a Pomodoro timer.
Rewards are given based on a few simple rules. When I finish a task early, it gives me the number of days early in rewards; if I finish tasks out of order, it gives me the nearer task's number of rewards, so I've got an incentive to finish tasks in order. I also get an extra reward for my first Pomodoro in a week for each of my projects, so that I have an incentive not to forget old projects. The system can also take away rewards. If I get distracted during a Pomodoro, I lose a reward. I'm blocked from redeeming rewards if I have a task within a day of its deadline. If I finish a task more than a day late, I lose any rewards in the system.
Results have been mixed so far. My greatest concern seems to have been unjustified: I haven't cheated on the machine once. However, it seems like the rules need some more work. The system has definitely helped some, but there are a lot of problems that could be improved.
The system doesn't account for the difficulty of tasks, meaning that I get more reward for less effort if I do easier work. As a result, I've done all of the reading up to next Tuesday for my literature class, but my Computer Science assignment due on Friday is unfinished, and my "research" for an exceptionally abhorrent humanities course is languishing on the vine.
The point of the system was to bring long-term rewards near, but there are a lot of circumstances in which it doesn't seem to bring them quite near enough. For deadlined tasks, I get no rewards until I've actually completed the task; if I think a task will take me more than a day to finish, that's more than a day of work which earns me no short-term rewards. This gets even worse if I happen to have a long task (or, many short tasks) that have reached the day before their deadline. Then, I don't get any rewards until I finish all of those tasks. While this is quite motivating, it's still a long-term motivation, i.e. it doesn't work very well.
I deliberately built the system to encourage doing tasks in order, but this seems to have backfired a little bit. Since I would be giving up rewards, I don't want to work on a task that's due later if there's another that's due sooner. However, if I really don't want to do the nearer task, I'll end up wasting time, since I get no rewards for that either way. Nyan_sandwich describes a similar failure mode in his Akrasia Case Study: if I know I have something more urgent to do, but I don't want to do it, I wind up procrastinating instead of doing less urgent things.
I get sick of candy more quickly than I expected. The portion my machine emits (about a small handful) tends to stop motivating me after about 4 in a day. Additionally, I seem to be entirely incapable of pacing myself; if the reward is in the system, I tend not to wait very long before using it. This has crippled all of the rules about involving taking away rewards - unless the rewards are blocked, they don't stick around in the system long enough to be taken away.
Not all of the things I want to change are a result of problems, though. There are a wide variety of interesting improvements I could make. Many of these are expansions: aside from my task list, what else can I connect to? Can I track note-taking in class? Can I set it up to reward continuing effort towards a task, like writing a few hundred words a day? Can I use it to create new, more rational habits? There are all kinds of possibilities to consider. If you've got anything you'd like to suggest, let me know - I'm open to anything interesting.
There are also a lot of techniques to research; I'm sure the program isn't nearly as effective as it could be. Operant conditioning techniques like variable-ratio schedules might help improve performance per candy. Or, I could look into gamification, basically a form of applied human operant conditioning; it's not a standard tool on the site, but if you've ever watched an experience bar rise, you know what I'm talking about. Again, if you happen to have some relevant ideas, let me know.
Obviously, I'm going to be making some rule changes in the near future. Expect another post in a few weeks about what's changed and how the changes have worked out for me.
Also, does anyone want to help me think of a good name for the system? Right now it's called the "extrinsic motivator." While descriptive, this name isn't snappy at all.
Goals vs. Rewards
Related: Terminal Values and Instrumental Values, Applying behavioral psychology on myself
Recently I asked myself, what do I want? My immediate response was that I wanted to be less stressed, particularly for financial reasons. So I started to affirm to myself that my goal was to become wealthy, and also to become less stressed. But then in a fit of cognitive dissonance, I realized that both money and relaxation are most easily considered in terms of being rewards, not goals. I was oddly surprised by the fact that there is a distinction between the two concepts to begin with.
It later occurred to me to wonder if some things work better when framed as goals and not as rewards. Freedom, long life, good relationships, and productivity seemed some likely candidates. I can't quite see them as rewards because a) I feel everyone innately deserves and should have them (even though they might have to work for them), and b) they don't quite give the kind of fuzzies that motivate immediate action.
These two kinds of positive motivation seem to work in psychologically dissimilar ways. Money for example is more like chocolate, something one has immediate instinctive motive to obtain and consume. Freedom of speech is more along the lines of having enough air to breathe. A person needs and perhaps inherently deserves to have at least a little bit of it all the time, and as a general rule will have a constant background motive to ensure that it stays available. It's a longer-term form of motivation.
A reward seems to be something where you receive immediate fuzzies when you achieve it. Getting paid, getting a pat on the back, getting your posts and comments upvoted... Things where you might consider them more or less optional in the grander scheme of things, yet they tend to trigger an immediate sense of positive anticipation before the event which is reinforced by a sense of satisfaction after. Actually writing a good post or comment, actually doing a good job, being a good spouse or friend -- these are surely related, but are goals in and of themselves. The mental picture for a goal is one of achieving, as opposed to receiving.
One thing that seems likely to me is that the presence of shared goals (and the communication thereof) tends to a good way to generate long term social bonds. Rewards seem to be more of a good way to deliberately steer behavior in more specific aspects. Both are thus important elements of social signaling within a tribe, but serve different underlying purposes.
As an example I have the transhumanist goal of eliminating the current limitations of the human lifespan, and tend to have an affinity for people who also internalize that goal. But someone who does not embrace that goal on a deep level may still display specific behavior that I consider helpful for that goal, e.g. displaying comprehension of its internal logic or having a tolerant attitude towards actions I think need to be taken. I'm probably somewhat less likely to form a long-term relationship with that person than if they were identifiable as a fellow transhumanist, but I am still likely to upvote their comments or otherwise signal approval in ways that don't demand too much long term commitment.
The distinctions I've drawn here between a goal and a reward might not apply directly to non-human intelligences. In fact it might be misleading in the more generalized context to call a reward something other than a goal (it is at least an implicit goal or value). However the distinction still seems like something that could be relevant for instrumental rationality and personal development. Our brains process the two forms of motivational anticipation in different ways. It may be that a part of the akrasia problem -- failure to take action towards a goal -- actually relates to a failure to properly categorize a given motive, and hence failure to process it usefully.
Thanks to the early commenters for their feedback: TheOtherDave, nornagest, endoself, David Gerard, nazgulnarsil, and Normal Anomaly. Hopefully this expanded version is more clear.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)