You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Concept Safety: World-models as tools

6 Kaj_Sotala 09 May 2015 12:07PM

 I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

The AI in the quantum box

In the previous post, I discussed the example of an AI whose concept space and goals were defined in terms of classical physics, which then learned about quantum mechanics. Let's elaborate on that scenario a little more.

I wish to zoom in on a certain assumption that I've noticed in previous discussions of these kinds of examples. Although I couldn't track down an exact citation right now, I'm pretty confident that I've heard the QM scenario framed as something like "the AI previously thought in terms of classical mechanics, but then it finds out that the world actually runs on quantum mechanics". The key assumption being that quantum mechanics is in some sense more real than classical mechanics.

This kind of an assumption is a natural one to make if someone is operating on an AIXI-inspired model of AI. Although AIXI considers an infinite amount of world-models, there's a sense in which AIXI always strives to only have one world-model. It's always looking for the simplest possible Turing machine that would produce all of the observations that it has seen so far, while ignoring the computational cost of actually running that machine. AIXI, upon finding out about quantum mechanics, would attempt to update its world-model into one that only contained QM primitives and to derive all macro-scale events right from first principles.

No sane design for a real-world AI would try to do this. Instead, a real-world AI would take advantage of scale separation. This refers to the fact that physical systems can be modeled on a variety of different scales, and it is in many cases sufficient to model them in terms of concepts that are defined in terms of higher-scale phenomena. In practice, the AI would have a number of different world-models, each of them being applied in different situations and for different purposes.

Here we get back to the view of concepts as tools, which I discussed in the previous post. An AI that was doing something akin to reinforcement learning would come to learn the kinds of world-models that gave it the highest rewards, and to selectively employ different world-models based on what was the best thing to do in each situation.

As a toy example, consider an AI that can choose to run a low-resolution or a high-resolution psychological model of someone it's interacting with, in order to predict their responses and please them. Say the low-resolution model takes a second to run and is 80% accurate; the high-resolution model takes five seconds to run and is 95% accurate. Which model will be chosen as the one to be used will depend on the cost matrix of making a correct prediction, making a false prediction, and the consequence of making the other person wait for an extra four seconds before the AI's each reply.

We can now see that a world-model being the most real, i.e. making the most accurate predictions, doesn't automatically mean that it will be used. It also needs to be fast enough to run, and the predictions need to be useful for achieving something that the AI cares about.

World-models as tools

From this point of view, world-models are literally tools just like any other. Traditionally in reinforcement learning, we would define the value of a policy  in state s as the expected reward given the state s and the policy ,

but under the "world-models are tools" perspective, we need to also condition on the world-model m,

 .

We are conditioning on the world-model in several distinct ways.

First, there is the expected behavior of the world as predicted by world-model m. A world-model over the laws of social interaction would do poorly at predicting the movement of celestial objects, if it could be applied to them at all. Different predictions of behavior may also lead to differing predictions of the value of a state. This is described by the equation above.

Second, there is the expected cost of using the world-model. Using a more detailed world-model may be more computationally expensive, for instance. One way of interpreting this in a classical RL framework would be that using a specific world-model will place the agent in a different state than using some other world-model. We might describe by saying that in addition to the agent choosing its next action a on each time-step, the agent also needs to choose the world-model m which it will use to analyze its next observations. This will be one of the inputs for the transition function  to the next state.

Third, there is the expected behavior of the agent using world-model m. An agent with different beliefs about the world will act differently in the future: this means that the future policy  actually depends on the chosen world-model.

Some very interesting questions pop up at this point. Your currently selected world-model is what you use to evaluate your best choices for the next step... including the choice of what world-model to use next. So whether or not you're going to switch to a different world-model for evaluating the next step depends on whether your current world-model says that a different world-model would be better in that step.

We have not fully defined what exactly we mean by "world-models" here. Previously I gave the example of a world-model over the laws of social interaction, versus a world-model over the laws of physics. But a world-model over the laws of social interaction, say, would not have an answer to the question of which world-model to use for things it couldn't predict. So one approach would be to say that we actually have some meta-model over world-models, telling us which is the best to use in what situation.

On the other hand, it does also seem like humans often use a specific world-model and its predictions to determine whether to choose another world-model. For example, in rationalist circles you often see arguments to the line of, "self-deception might give you extra confidence, but it introduces errors into your world-model, and in the long term those are going to be more harmful than the extra confidence is beneficial". Here you see an implicit appeal to a world-model which predicts an accumulation of false beliefs with some specific effects, as well as predicting the extra self-esteem with its effects. But this kind of an analysis incorporates very specific causal claims from various (e.g. psychological) models, which are models over the world rather than just being part of some general meta-model over models. Notice also that the example analysis takes into account the way that having a specific world-model affects the state transition function: it assumes that a self-deceptive model may land us in a state where we have a higher self-esteem.

It's possible to get stuck in one world-model: for example, a strongly non-reductionist model evaluating the claims of a highly reductionist one might think it obviously crazy, and vice versa. So it seems that we do need something like a meta-evaluation function. Otherwise it would be too easy to get stuck in one model which claimed that it was the best one in every possible situation, and never agreed to "give up control" in favor of another one.

One possibility for such a thing would be a relatively model-free learning mechanism, which just kept track of the rewards accumulated when using a particular model in a particular situation. It would then bias the selection of the model towards the direction of the model that had been the most successful so far.

Human neuroscience and meta-models

We might be able to identify something like this in humans, though this is currently very speculative on my part. Action selection is carried out in the basal ganglia: different brain systems send the basal ganglia "bids" for various actions. The basal ganglia then chooses which actions to inhibit or disinhibit (by default, everything is inhibited). The basal ganglia also implements reinforcement learning, selectively strengthening or weakening the connections associated with a particular bid and context when a chosen action leads to a higher or lower reward than was expected. It seems that in addition to choosing between motor actions, the basal ganglia also chooses between different cognitive behaviors, likely even thoughts

If action selection and reinforcement learning are normal functions of the basal ganglia, it should be possible to interpret many of the human basal ganglia-related disorders in terms of selection malfunctions. For example, the akinesia of Parkinson's disease may be seen as a failure to inhibit tonic inhibitory output signals on any of the sensorimotor channels. Aspects of schizophrenia, attention deficit disorder and Tourette's syndrome could reflect different forms of failure to maintain sufficient inhibitory output activity in non-selected channels. Conseqently, insufficiently inhibited signals in non-selected target structures could interfere with the output of selected targets (expressed as motor/verbal tics) and/or make the selection system vulnerable to interruption from distracting stimuli (schizophrenia, attention deficit disorder). The opposite situation would be where the selection of one functional channel is abnormally dominant thereby making it difficult for competing events to interrupt or cause a behavioural or attentional switch. Such circumstances could underlie addictive compulsions or obsessive compulsive disorder. (Redgrave 2007)

Although I haven't seen a paper presenting evidence for this particular claim, it seems plausible to assume that humans similarly come to employ new kinds of world-models based on the extent to which using a particular world-model in a particular situation gives them rewards. When a person is in a situation where they might think in terms of several different world-models, there will be neural bids associated with mental activities that recruit the different models. Over time, the bids associated with the most successful models will become increasingly favored. This is also compatible with what we know about e.g. happy death spirals and motivated stopping: people will tend to have the kinds of thoughts which are rewarding to them.

The physicist and the AI

In my previous post, when discussing the example of the physicist who doesn't jump out of the window when they learn about QM and find out that "location" is ill-defined:

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics. [...]

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward. 

There are several parts to this.

1. The "physicist's reward function remains defined over immediate sensory experiences". Them falling down and breaking their leg is still going to hurt, and they know that this won't be changed no matter how they try to redefine reality.

2. The physicist's value function also remains defined over immediate sensory experiences. They know that jumping out of a window and ending up with all the bones in their body being broken is going to be really inconvenient even if you disregarded the physical pain. They still cannot do the things they would like to do, and they have learned that being in such a state is non-desirable. Again, this won't be affected by how they try to define reality.

We now have a somewhat better understanding of what exactly this means. The physicist has spent their entire life living in the classical world, and obtained nearly all of their rewards by thinking in terms of the classical world. As a result, using the classical model for reasoning about life has become strongly selected for. Also, the physicist's classical world-model predicts that thinking in terms of that model is a very good thing for surviving, and that trying to switch to a QM model where location was ill-defined would be a very bad thing for the goal of surviving. On the other hand, thinking in terms of exotic world-models remains a rewarding thing for goals such as obtaining social status or making interesting discoveries, so the QM model does get more strongly reinforced in that context and for that purpose.

Getting back to the question of how to make the AI stay in the box, ideally we could mimic this process, so that the AI would initially come to care about staying in the box. Then when it learns about QM, it understands that thinking in QM terms is useful for some goals, but if it were to make itself think in purely QM terms, that would cause it to leave the box. Because it is thinking mostly in terms of a classical model, which says that leaving the box would be bad (analogous to the physicist thinking mostly in terms of the classical model which says that jumping out of the window would be bad), it wants to make sure that it will continue to think in terms of the classical model when it's reasoning about its location.

Concept Safety: What are concepts for, and how to deal with alien concepts

11 Kaj_Sotala 19 April 2015 01:44PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?

I gave some (more or less) concrete examples of this kind of a "conceptual expansion":

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
  3. As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
  4. An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.

As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.

The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.

As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.

We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.

Concepts, values, and reinforcement learning

Before we go on, we need to talk a bit about why we have concepts and values in the first place.

From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.

How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)

The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.

There are two major reasons why an RL agent, like a human, should learn high-level concepts:

  1. They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
  2. There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.

Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s1 gives you a reward, and that there are also states s2 - s5 in which taking some specific action causes you to end up in s1. Then we should expect the creature to develop a common concept for being in the states s2 - s5, and we should expect that concept to be "more similar" to the concept of being in state s1 than to the concept of being in some state that was many actions away.

"More similar" how?

In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.

Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.

As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).

Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".

This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.

Getting back to conceptual expansion

Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.

I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".

Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.

As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.

Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.

As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)

The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.

Dealing with the AI in the box

This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.

Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?

Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?

Maybe some do. But I would wager that most don't. Why not?

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.

In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward. 

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

Next post in series: World-models as tools.

'Dumb' AI observes and manipulates controllers

33 Stuart_Armstrong 13 January 2015 01:35PM

The argument that AIs provided with a reward channel will observe their controllers and learn to manipulate them is a valid one. Unfortunately, it's often framed in a way that feels counterintuitive or extreme, especially to AI designers. It typically starts with the standard reinforcement learning scenario, then posits that the AI becomes superintelligent and either manipulates the controller with super-social powers, or breaks out and gains control of its reward channel, killing or threatening its controllers.

And that is a fair argument. But conceptually, it leaps from a standard reinforcement learning scenario, to a science-fiction-sounding scenario. It might help to have intermediate scenarios: to show that even lower intelligence AIs might start exhibiting the same sort of behaviour, long before it gets to superintelligence.

So consider the following scenario. Some complex, trainable AI is tasked with writing automated news stories for a student newspaper. It trawls the web and composes its stories, then gets reward and feedback from the editors. Assume there are two editors for this newspaper, and they work on alternate days. The two editors have somewhat different ideas as to what constitutes a good story, so their feedbacks are different. After a while, the AI finds that it gets higher reward by using a certain style on Monday, Wednesday and Friday, and another style on Tuesday and Thursdays - this is a simple consequence of its reward mechanism.

But the rota isn't perfect. Sometime the Monday editor will edit a story so late on Monday that it's Tuesday, and sometimes the Tuesday editor will be up early and edit a story at the same time. This pattern is more common during the holidays, when the editors go to different time zones, often travel, and become more irregular in their editing.

So if the AI complex and skilled enough, then, simply through simple feedback, it will start building up a picture of its editors. It will figure out when they are likely to stick to a schedule, and when they will be more irregular. It will figure out the difference between holidays and non-holidays. Given time, it may be able to track the editors moods and it will certainly pick up on any major change in their lives - such as romantic relationships and breakups, which will radically change whether and how it should present stories with a romantic focus.

It will also likely learn the correlation between stories and feedbacks - maybe presenting a story define roughly as "positive" will increase subsequent reward for the rest of the day, on all stories. Or maybe this will only work on a certain editor, or only early in the term. Or only before lunch.

Thus the simple trainable AI with a particular focus - write automated news stories - will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.

This may be a useful "bridging example" between standard RL agents and the superintelligent machines.

Applying reinforcement learning theory to reduce felt temporal distance

11 Kaj_Sotala 26 January 2014 09:17AM

(cross-posted from my blog)

It is a basic principle of reinforcement learning to distinguish between reward and value, where the reward of a state is the immediate, intrinsic desirability of the state, whereas the value of the state is proportional to the rewards of the other states that you can reach from that state.

For example, suppose that I’m playing a competitive game of chess, and in addition to winning I happen to like capturing my opponent’s pieces, even when it doesn’t contribute to winning. I assign a reward of 10 points to winning, -10 to losing, 0 to a stalemate, and 1 point to each piece that I capture in the game. Now my opponent offers me a chance to capture one of his pawns, an action that would give me one point worth of reward. But when I look at the situation more closely, I see that it’s a trap: if I did capture the piece, I would be forced into a set of moves that would inevitably result in my defeat. So the value, or long-term reward, of that state is actually something close to -9.

Once I realize this, I also realize that making that move is almost exactly equivalent to agreeing to resign in exchange for my opponent letting me capture one of his pieces. My defeat won’t be instant, but by making that move, I would nonetheless be choosing to lose.

Now consider a dilemma that I might be faced with when coming home late some evening. I have no food at home, but I’m feeling exhausted and don’t want to bother with going to the store, and I’ve already eaten today anyway. But I also know that if I wake up with no food in the house, then I will quickly end up with low energy, which makes it harder to go to the store, which means my energy levels will drop further, and so on until I’ll finally get something to eat much later, after wasting a long time in an uncomfortable state.

Typically, temporal discounting means that I’m aware of this in the evening, but nonetheless skip the visit to the store. The penalty from not going feels remote, whereas the discomfort of going feels close, and that ends up dominating my decision-making. Besides, I can always hope that the next morning will be an exception, and I’ll actually get myself to go to the store right from the moment when I wake up!

And I haven’t tried this out for very long, but it feels like explicitly framing the different actions in terms of reward and value could be useful in reducing the impact of that experienced distance. I skip the visit to the store because being hungry in the morning is something that seems remote. But if I think that skipping the visit is exactly the same thing as choosing to be hungry in the morning, and that the value of skipping the visit is not the momentary relief of being home earlier but rather the inevitable consequence of the causal chain that it sets in motion – culminating in hours of hunger and low energy – then that feels a lot different.

And of course, I can propagate the consequences earlier back in time as well: if I think that I simply won’t have the energy to get food when I finally come home, then I should realize that I need to go buy the food before setting out on that trip. Otherwise I’ll again set in motion a causal chain whose end result is being hungry. So then not going shopping before I leave becomes exactly the same thing as being hungry next morning.

More examples of the same:

  • Slightly earlier I considered taking a shower, and realized that if I'd take a shower in my current state of mind I'd inevitably make it into a bath as well. So I wasn't really just considering whether to take a shower, but whether to take a shower *and* a bath. That said, I wasn't in a hurry anywhere and there didn't seem to be a big harm in also taking the bath, so I decided to go ahead with it.
  • While in the shower/bath, I started thinking about this post, and decided that I wanted to get it written. But I also wanted to enjoy my hot bath for a while longer. Considering it, I realized that staying in the bath for too long might cause me to lose my motivation for writing this, so there was a chance that staying in the bath would become the same thing as choosing not to get this written. I decided that the risk wasn't worth it, and got up.
  • If I'm going somewhere and I choose a route that causes me to walk past a fast-food place selling something that I know I shouldn't eat, and I know that the sight of that fast-food place is very likely to tempt me to eat there anyway, then choosing that particular route is the same thing as choosing to go eat something that I know I shouldn't.

Related post: Applied cognitive science: learning from a faux pas.

The Perception-Action Cycle

6 royf 23 July 2012 01:44AM

Would readers be interested in a sequence of posts offering an intuitive explanation of my underway thesis on the application of information theory to reinforcement learning? Please also feel free to comment on the quality of my presentation.

In this first post I offer a high-level description of the Perception-Action Cycle as an intuitive explanation of reinforcement learning.

 


 

Imagine that the world is divided into two parts: one we shall call the agent and the rest - its environment. Imagine that the two interact in turns. One moment the agent receives information from its environment in the form of an observation. Then the next moment the agent sends out information to its environment in the form of an action. Then it makes another observation, then another action, and so on.

To break down the cycle, we start with the agent having a belief about the state of its environment. This is actually the technical term: the belief is the probability that the agent assigns, implicitly, to each possible state of the environment. The cycle then proceeds in 4 phases.

In the first phase, the agent makes an observation. Since the observation conveys information of the environment, the agent needs to update its belief, ideally using Bayes' theorem. The agent now has more information about the environment.

In the second phase, the agent uses this new information to update its plan. Note the crucial underlying principle that information about the environment is useful in making better plans. This gives a desired fusion between Bayesian updates and decision making.

In the third phase, the agent executes a step of its plan - a single action. This changes the environment. Some of the things that the agent knew about the previous state of the environment may no longer be true, and the agent is back to having less information.

In the fourth phase, the agent makes a prediction about future observations. The importance of making a prediction before a scientific experiment is well understood by philosophers of science. But the importance of constantly making predictions of all of our sensory inputs as a functional part of our cognition, is only now dawning on neuroscientists and machine learning researchers.

 


 

The Perception-Action Cycle is an intuitive explanation of the technical setting of reinforcement learning. Reinforcement learning is a powerful model of machine learning, in which decision making, learning and evaluation occur simultaneously and somewhat implicitly while a learner interacts with its environment. This can be used to describe a wide variety of real-life scenarios, including biological and artificial agents. It is so general, in fact, that our work is still ahead of us if we want it to have any explanatory power, and solving it in the most general form is a computationally hard problem.

But the Perception-Action Cycle still offers symmetries to explore, analogies to physics to draw, practical learning algorithms to develop; all of which improve its Occam's razor prior score as a good model of intelligence. And to use it to actually explain things, we can narrow it down further. Not everything that it makes possible is equally probable. By applying information theory, a collection of statistical concepts, theorems and methods implied by strong Bayesianism, we can get a better picture of what intelligence is and isn't.