## Concept Safety: World-models as tools

5 09 May 2015 12:07PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

The AI in the quantum box

In the previous post, I discussed the example of an AI whose concept space and goals were defined in terms of classical physics, which then learned about quantum mechanics. Let's elaborate on that scenario a little more.

I wish to zoom in on a certain assumption that I've noticed in previous discussions of these kinds of examples. Although I couldn't track down an exact citation right now, I'm pretty confident that I've heard the QM scenario framed as something like "the AI previously thought in terms of classical mechanics, but then it finds out that the world actually runs on quantum mechanics". The key assumption being that quantum mechanics is in some sense more real than classical mechanics.

This kind of an assumption is a natural one to make if someone is operating on an AIXI-inspired model of AI. Although AIXI considers an infinite amount of world-models, there's a sense in which AIXI always strives to only have one world-model. It's always looking for the simplest possible Turing machine that would produce all of the observations that it has seen so far, while ignoring the computational cost of actually running that machine. AIXI, upon finding out about quantum mechanics, would attempt to update its world-model into one that only contained QM primitives and to derive all macro-scale events right from first principles.

No sane design for a real-world AI would try to do this. Instead, a real-world AI would take advantage of scale separation. This refers to the fact that physical systems can be modeled on a variety of different scales, and it is in many cases sufficient to model them in terms of concepts that are defined in terms of higher-scale phenomena. In practice, the AI would have a number of different world-models, each of them being applied in different situations and for different purposes.

Here we get back to the view of concepts as tools, which I discussed in the previous post. An AI that was doing something akin to reinforcement learning would come to learn the kinds of world-models that gave it the highest rewards, and to selectively employ different world-models based on what was the best thing to do in each situation.

As a toy example, consider an AI that can choose to run a low-resolution or a high-resolution psychological model of someone it's interacting with, in order to predict their responses and please them. Say the low-resolution model takes a second to run and is 80% accurate; the high-resolution model takes five seconds to run and is 95% accurate. Which model will be chosen as the one to be used will depend on the cost matrix of making a correct prediction, making a false prediction, and the consequence of making the other person wait for an extra four seconds before the AI's each reply.

We can now see that a world-model being the most real, i.e. making the most accurate predictions, doesn't automatically mean that it will be used. It also needs to be fast enough to run, and the predictions need to be useful for achieving something that the AI cares about.

World-models as tools

From this point of view, world-models are literally tools just like any other. Traditionally in reinforcement learning, we would define the value of a policy $\pi$ in state s as the expected reward given the state s and the policy $\pi$,

$V^{\pi}(s)=E[R|s,\pi]$

but under the "world-models are tools" perspective, we need to also condition on the world-model m,

$V^{\pi}(s,m)=E[R|s,\pi,m]$ .

We are conditioning on the world-model in several distinct ways.

First, there is the expected behavior of the world as predicted by world-model m. A world-model over the laws of social interaction would do poorly at predicting the movement of celestial objects, if it could be applied to them at all. Different predictions of behavior may also lead to differing predictions of the value of a state. This is described by the equation above.

Second, there is the expected cost of using the world-model. Using a more detailed world-model may be more computationally expensive, for instance. One way of interpreting this in a classical RL framework would be that using a specific world-model will place the agent in a different state than using some other world-model. We might describe by saying that in addition to the agent choosing its next action a on each time-step, the agent also needs to choose the world-model m which it will use to analyze its next observations. This will be one of the inputs for the transition function $\delta$ to the next state.

$\delta:s\times(a,m)\rightarrow\(s,m)$

Third, there is the expected behavior of the agent using world-model m. An agent with different beliefs about the world will act differently in the future: this means that the future policy $\pi_{f}$ actually depends on the chosen world-model.

$P(\pi_{f})=P(\pi|m)$

Some very interesting questions pop up at this point. Your currently selected world-model is what you use to evaluate your best choices for the next step... including the choice of what world-model to use next. So whether or not you're going to switch to a different world-model for evaluating the next step depends on whether your current world-model says that a different world-model would be better in that step.

We have not fully defined what exactly we mean by "world-models" here. Previously I gave the example of a world-model over the laws of social interaction, versus a world-model over the laws of physics. But a world-model over the laws of social interaction, say, would not have an answer to the question of which world-model to use for things it couldn't predict. So one approach would be to say that we actually have some meta-model over world-models, telling us which is the best to use in what situation.

On the other hand, it does also seem like humans often use a specific world-model and its predictions to determine whether to choose another world-model. For example, in rationalist circles you often see arguments to the line of, "self-deception might give you extra confidence, but it introduces errors into your world-model, and in the long term those are going to be more harmful than the extra confidence is beneficial". Here you see an implicit appeal to a world-model which predicts an accumulation of false beliefs with some specific effects, as well as predicting the extra self-esteem with its effects. But this kind of an analysis incorporates very specific causal claims from various (e.g. psychological) models, which are models over the world rather than just being part of some general meta-model over models. Notice also that the example analysis takes into account the way that having a specific world-model affects the state transition function: it assumes that a self-deceptive model may land us in a state where we have a higher self-esteem.

It's possible to get stuck in one world-model: for example, a strongly non-reductionist model evaluating the claims of a highly reductionist one might think it obviously crazy, and vice versa. So it seems that we do need something like a meta-evaluation function. Otherwise it would be too easy to get stuck in one model which claimed that it was the best one in every possible situation, and never agreed to "give up control" in favor of another one.

One possibility for such a thing would be a relatively model-free learning mechanism, which just kept track of the rewards accumulated when using a particular model in a particular situation. It would then bias the selection of the model towards the direction of the model that had been the most successful so far.

Human neuroscience and meta-models

We might be able to identify something like this in humans, though this is currently very speculative on my part. Action selection is carried out in the basal ganglia: different brain systems send the basal ganglia "bids" for various actions. The basal ganglia then chooses which actions to inhibit or disinhibit (by default, everything is inhibited). The basal ganglia also implements reinforcement learning, selectively strengthening or weakening the connections associated with a particular bid and context when a chosen action leads to a higher or lower reward than was expected. It seems that in addition to choosing between motor actions, the basal ganglia also chooses between different cognitive behaviors, likely even thoughts

If action selection and reinforcement learning are normal functions of the basal ganglia, it should be possible to interpret many of the human basal ganglia-related disorders in terms of selection malfunctions. For example, the akinesia of Parkinson's disease may be seen as a failure to inhibit tonic inhibitory output signals on any of the sensorimotor channels. Aspects of schizophrenia, attention deficit disorder and Tourette's syndrome could reflect different forms of failure to maintain sufficient inhibitory output activity in non-selected channels. Conseqently, insufficiently inhibited signals in non-selected target structures could interfere with the output of selected targets (expressed as motor/verbal tics) and/or make the selection system vulnerable to interruption from distracting stimuli (schizophrenia, attention deficit disorder). The opposite situation would be where the selection of one functional channel is abnormally dominant thereby making it difficult for competing events to interrupt or cause a behavioural or attentional switch. Such circumstances could underlie addictive compulsions or obsessive compulsive disorder. (Redgrave 2007)

Although I haven't seen a paper presenting evidence for this particular claim, it seems plausible to assume that humans similarly come to employ new kinds of world-models based on the extent to which using a particular world-model in a particular situation gives them rewards. When a person is in a situation where they might think in terms of several different world-models, there will be neural bids associated with mental activities that recruit the different models. Over time, the bids associated with the most successful models will become increasingly favored. This is also compatible with what we know about e.g. happy death spirals and motivated stopping: people will tend to have the kinds of thoughts which are rewarding to them.

The physicist and the AI

In my previous post, when discussing the example of the physicist who doesn't jump out of the window when they learn about QM and find out that "location" is ill-defined:

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics. [...]

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward.

There are several parts to this.

1. The "physicist's reward function remains defined over immediate sensory experiences". Them falling down and breaking their leg is still going to hurt, and they know that this won't be changed no matter how they try to redefine reality.

2. The physicist's value function also remains defined over immediate sensory experiences. They know that jumping out of a window and ending up with all the bones in their body being broken is going to be really inconvenient even if you disregarded the physical pain. They still cannot do the things they would like to do, and they have learned that being in such a state is non-desirable. Again, this won't be affected by how they try to define reality.

We now have a somewhat better understanding of what exactly this means. The physicist has spent their entire life living in the classical world, and obtained nearly all of their rewards by thinking in terms of the classical world. As a result, using the classical model for reasoning about life has become strongly selected for. Also, the physicist's classical world-model predicts that thinking in terms of that model is a very good thing for surviving, and that trying to switch to a QM model where location was ill-defined would be a very bad thing for the goal of surviving. On the other hand, thinking in terms of exotic world-models remains a rewarding thing for goals such as obtaining social status or making interesting discoveries, so the QM model does get more strongly reinforced in that context and for that purpose.

Getting back to the question of how to make the AI stay in the box, ideally we could mimic this process, so that the AI would initially come to care about staying in the box. Then when it learns about QM, it understands that thinking in QM terms is useful for some goals, but if it were to make itself think in purely QM terms, that would cause it to leave the box. Because it is thinking mostly in terms of a classical model, which says that leaving the box would be bad (analogous to the physicist thinking mostly in terms of the classical model which says that jumping out of the window would be bad), it wants to make sure that it will continue to think in terms of the classical model when it's reasoning about its location.

## Concept Safety: What are concepts for, and how to deal with alien concepts

10 19 April 2015 01:44PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?

I gave some (more or less) concrete examples of this kind of a "conceptual expansion":

1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
3. As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
4. An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.

As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.

The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.

As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.

We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.

Concepts, values, and reinforcement learning

Before we go on, we need to talk a bit about why we have concepts and values in the first place.

From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.

How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)

The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.

There are two major reasons why an RL agent, like a human, should learn high-level concepts:

1. They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
2. There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.

Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s1 gives you a reward, and that there are also states s2 - s5 in which taking some specific action causes you to end up in s1. Then we should expect the creature to develop a common concept for being in the states s2 - s5, and we should expect that concept to be "more similar" to the concept of being in state s1 than to the concept of being in some state that was many actions away.

"More similar" how?

In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.

Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.

As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).

Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".

This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.

Getting back to conceptual expansion

Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:

1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.

I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".

Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.

As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.

Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.

As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)

The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.

Dealing with the AI in the box

This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.

Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?

Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?

Maybe some do. But I would wager that most don't. Why not?

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.

In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

Next post in series: World-models as tools.

## Concept Safety: The problem of alien concepts

14 17 April 2015 02:09PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In the previous post in this series, I talked about how one might get an AI to have similar concepts as humans. However, one would intuitively assume that a superintelligent AI might eventually develop the capability to entertain far more sophisticated concepts than humans would ever be capable of having. Is that a problem?

Just what are concepts, anyway?

To answer the question, we first need to define what exactly it is that we mean by a "concept", and why exactly more sophisticated concepts would be a problem.

Unfortunately, there isn't really any standard definition of this in the literature, with different theorists having different definitions. Machery even argues that the term "concept" doesn't refer to a natural kind, and that we should just get rid of the whole term. If nothing else, this definition from Kruschke (2008) is at least amusing:

Models of categorization are usually designed to address data from laboratory experiments, so “categorization” might be best defined as the class of behavioral data generated by experiments that ostensibly study categorization.

Because I don't really have the time to survey the whole literature and try to come up with one grand theory of the subject, I will for now limit my scope and only consider two compatible definitions of the term.

Definition 1: Concepts as multimodal neural representations. I touched upon this definition in the last post, where I mentioned studies indicating that the brain seems to have shared neural representations for e.g. the touch and sight of a banana. Current neuroscience seems to indicate the existence of brain areas where representations from several different senses are combined together into higher-level representations, and where the activation of any such higher-level representation will also end up activating the lower sense modalities in turn. As summarized by Man et al. (2013):

Briefly, the Damasio framework proposes an architecture of convergence-divergence zones (CDZ) and a mechanism of time-locked retroactivation. Convergence-divergence zones are arranged in a multi-level hierarchy, with higher-level CDZs being both sensitive to, and capable of reinstating, specific patterns of activity in lower-level CDZs. Successive levels of CDZs are tuned to detect increasingly complex features. Each more-complex feature is defined by the conjunction and configuration of multiple less-complex features detected by the preceding level. CDZs at the highest levels of the hierarchy achieve the highest level of semantic and contextual integration, across all sensory modalities. At the foundations of the hierarchy lie the early sensory cortices, each containing a mapped (i.e., retinotopic, tonotopic, or somatotopic) representation of sensory space. When a CDZ is activated by an input pattern that resembles the template for which it has been tuned, it retro-activates the template pattern of lower-level CDZs. This continues down the hierarchy of CDZs, resulting in an ensemble of well-specified and time-locked activity extending to the early sensory cortices.

On this account, my mental concept for "dog" consists of a neural activation pattern making up the sight, sound, etc. of some dog - either a generic prototypical dog or some more specific dog. Likely the pattern is not just limited to sensory information, either, but may be associated with e.g. motor programs related to dogs. For example, the program for throwing a ball for the dog to fetch. One version of this hypothesis, the Perceptual Symbol Systems account, calls such multimodal representations simulators, and describes them as follows (Niedenthal et al. 2005):

A simulator integrates the modality-specific content of a category across instances and provides the ability to identify items encountered subsequently as instances of the same category. Consider a simulator for the social category, politician. Following exposure to different politicians, visual information about how typical politicians look (i.e., based on their typical age, sex, and role constraints on their dress and their facial expressions) becomes integrated in the simulator, along with auditory information for how they typically sound when they talk (or scream or grovel), motor programs for interacting with them, typical emotional responses induced in interactions or exposures to them, and so forth. The consequence is a system distributed throughout the brain’s feature and association areas that essentially represents knowledge of the social category, politician.

The inclusion of such "extra-sensory" features helps understand how even abstract concepts could fit this framework: for example, one's understanding of the concept of a derivative might be partially linked to the procedural programs one has developed while solving derivatives. For a more detailed hypothesis of how abstract mathematics may emerge from basic sensory and motor programs and concepts, I recommend Lakoff & Nuñez (2001).

Definition 2: Concepts as areas in a psychological space. This definition, while being compatible with the previous one, looks at concepts more "from the inside". Gärdenfors (2000) defines the basic building blocks of a psychological conceptual space to be various quality dimensions, such as temperature, weight, brightness, pitch, and the spatial dimensions of height, width, and depth. These are psychological in the sense of being derived from our phenomenal experience of certain kinds of properties, rather than the way in which they might exist in some objective reality.

For example, one way of modeling the psychological sense of color is via a color space defined by the quality dimensions of hue (represented by the familiar color circle), chromaticness (saturation), and brightness.

The second phenomenal dimension of color is chromaticness (saturation), which ranges from grey (zero color intensity) to increasingly greater intensities. This dimension is isomorphic to an interval of the real line. The third dimension is brightness which varies from white to black and is thus a linear dimension with two end points. The two latter dimensions are not totally independent, since the possible variation of the chromaticness dimension decreases as the values of the brightness dimension approaches the extreme points of black and white, respectively. In other words, for an almost white or almost black color, there can be very little variation in its chromaticness. This is modeled by letting that chromaticness and brightness dimension together generate a triangular representation ... Together these three dimensions, one with circular structure and two with linear, make up the color space. This space is often illustrated by the so called color spindle

This kind of a representation is different from the physical wavelength representation of color, where e.g. the hue is mostly related to the wavelength of the color. The wavelength representation of hue would be linear, but due to the properties of the human visual system, the psychological representation of hue is circular.

Gärdenfors defines two quality dimensions to be integral if a value cannot be given for an object on one dimension without also giving it a value for the other dimension: for example, an object cannot be given a hue value without also giving it a brightness value. Dimensions that are not integral with each other are separable. A conceptual domain is a set of integral dimensions that are separable from all other dimensions: for example, the three color-dimensions form the domain of color.

From these definitions, Gärdenfors develops a theory of concepts where more complicated conceptual spaces can be formed by combining lower-level domains. Concepts, then, are particular regions in these conceptual spaces: for example, the concept of "blue" can be defined as a particular region in the domain of color. Notice that the notion of various combinations of basic perceptual domains making more complicated conceptual spaces possible fits well together with the models discussed in our previous definition. There more complicated concepts were made possible by combining basic neural representations for e.g. different sensory modalities.

The origin of the different quality dimensions could also emerge from the specific properties of the different simulators, as in PSS theory.

Thus definition #1 allows us to talk about what a concept might "look like from the outside", with definition #2 talking about what the same concept might "look like from the inside".

Interestingly, Gärdenfors hypothesizes that much of the work involved with learning new concepts has to do with learning new quality dimensions to fit into one's conceptual space, and that once this is done, all that remains is the comparatively much simpler task of just dividing up the new domain to match seen examples.

For example, consider the (phenomenal) dimension of volume. The experiments on "conservation" performed by Piaget and his followers indicate that small children have no separate representation of volume; they confuse the volume of a liquid with the height of the liquid in its container. It is only at about the age of five years that they learn to represent the two dimensions separately. Similarly, three- and four-year-olds confuse high with tall, big with bright, and so forth (Carey 1978).

The problem of alien concepts

With these definitions for concepts, we can now consider what problems would follow if we started off with a very human-like AI that had the same concepts as we did, but then expanded its conceptual space to allow for entirely new kinds of concepts. This could happen if it self-modified to have new kinds of sensory or thought modalities that it could associate its existing concepts with, thus developing new kinds of quality dimensions.

An analogy helps demonstrate this problem: suppose that you're operating in a two-dimensional space, where a rectangle has been drawn to mark a certain area as "forbidden" or "allowed". Say that you're an inhabitant of Flatland. But then you suddenly become aware that actually, the world is three-dimensional, and has a height dimension as well! That raises the question of, how should the "forbidden" or "allowed" area be understood in this new three-dimensional world? Do the walls of the rectangle extend infinitely in the height dimension, or perhaps just some certain distance in it? If just a certain distance, does the rectangle have a "roof" or "floor", or can you just enter (or leave) the rectangle from the top or the bottom? There doesn't seem to be any clear way to tell.

As a historical curiosity, this dilemma actually kind of really happened when airplanes were invented: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? Courts and legislation eventually settled on the latter answer. A more AI-relevant example might be if one was trying to limit the AI with rules such as "stay within this box here", and the AI then gained an intuitive understanding of quantum mechanics, which might allow it to escape from the box without violating the rule in terms of its new concept space.

More generally, if previously your concepts had N dimensions and now they have N+1, you might find something that fulfilled all the previous criteria while still being different from what we'd prefer if we knew about the N+1th dimension.

In the next post, I will present some (very preliminary and probably wrong) ideas for solving this problem.

Next post in series: What are concepts for, and how to deal with alien concepts.

## Concept Safety: Producing similar AI-human concept spaces

28 14 April 2015 08:39PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner. For example, Armstrong, Sandberg & Bostrom (2012) consider the possibility of restricting an AI via "rule-based motivational control" and programming it to follow restrictions like "stay within this lead box here", but they raise worries about the difficulty of rigorously defining "this lead box here". To address this, they go on to consider the possibility of making an AI internalize human concepts via feedback, with the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that. The authors are however worried that this may fail, because

Humans seem quite adept at constructing the correct generalisations – most of us have correctly deduced what we should/should not be doing in general situations (whether or not we follow those rules). But humans share a common of genetic design, which the OAI would likely not have. Sharing, for instance, derives partially from genetic predisposition to reciprocal altruism: the OAI may not integrate the same concept as a human child would. Though reinforcement learning has a good track record, it is neither a panacea nor a guarantee that the OAIs generalisations agree with ours.

Addressing this, a possibility that I raised in Sotala (2015) was that possibly the concept-learning mechanisms in the human brain are actually relatively simple, and that we could replicate the human concept learning process by replicating those rules. I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms. Later on the post, I will discuss how one might try to verify that similar representations had in fact been learned, and how to set up a system to make them even more similar.

Word embedding

A particularly fascinating branch of recent research relates to the learning of word embeddings, which are mappings of words to very high-dimensional vectors. It turns out that if you train a system on one of several kinds of tasks, such as being able to classify sentences as valid or invalid, this builds up a space of word vectors that reflects the relationships between the words. For example, there seems to be a male/female dimension to words, so that there's a "female vector" that we can add to the word "man" to get "woman" - or, equivalently, which we can subtract from "woman" to get "man". And it so happens (Mikolov, Yih & Zweig 2013) that we can also get from the word "king" to the word "queen" by adding the same vector to "king". In general, we can (roughly) get to the male/female version of any word vector by adding or subtracting this one difference vector!

Why would this happen? Well, a learner that needs to classify sentences as valid or invalid needs to classify the sentence "the king sat on his throne" as valid while classifying the sentence "the king sat on her throne" as invalid. So including a gender dimension on the built-up representation makes sense.

But gender isn't the only kind of relationship that gets reflected in the geometry of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of training mechanism, a lot of relationships that we're intuitively aware of become automatically learned and represented in the concept geometry. And like Olah (2014) comments:

It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.

This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.

It gets even more interesting, for we can use these for translation. Since Olah has already written an excellent exposition of this, I'll just quote him:

We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.

We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.

Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.

In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.

Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.

After this, it gets even more interesting. Suppose you had this space of word vectors, and then you also had a system which translated images into vectors in the same space. If you have images of dogs, you put them near the word vector for dog. If you have images of Clippy you put them near word vector for "paperclip". And so on.

You do that, and then you take some class of images the image-classifier was never trained on, like images of cats. You ask it to place the cat-image somewhere in the vector space. Where does it end up?

You guessed it: in the rough region of the "cat" words. Olah once more:

This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.

The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.

These algorithms made no attempt of being biologically realistic in any way. They didn't try classifying data the way the brain does it: they just tried classifying data using whatever worked. And it turned out that this was enough to start constructing a multimodal representation space where a lot of the relationships between entities were similar to the way humans understand the world.

How useful is this?

"Well, that's cool", you might now say. "But those word spaces were constructed from human linguistic data, for the purpose of predicting human sentences. Of course they're going to classify the world in the same way as humans do: they're basically learning the human representation of the world. That doesn't mean that an autonomously learning AI, with its own learning faculties and systems, is necessarily going to learn a similar internal representation, or to have similar concepts."

This is a fair criticism. But it is mildly suggestive of the possibility that an AI that was trained to understand the world via feedback from human operators would end up building a similar conceptual space. At least assuming that we chose the right learning algorithms.

When we train a language model to classify sentences by labeling some of them as valid and others as invalid, there's a hidden structure implicit in our answers: the structure of how we understand the world, and of how we think of the meaning of words. The language model extracts that hidden structure and begins to classify previously unseen things in terms of those implicit reasoning patterns. Similarly, if we gave an AI feedback about what kinds of actions counted as "leaving the box" and which ones didn't, there would be a certain way of viewing and conceptualizing the world implied by that feedback, one which the AI could learn.

Comparing representations

"Hmm, maaaaaaaaybe", is your skeptical answer. "But how would you ever know? Like, you can test the AI in your training situation, but how do you know that it's actually acquired a similar-enough representation and not something wildly off? And it's one thing to look at those vector spaces and claim that there are human-like relationships among the different items, but that's still a little hand-wavy. We don't actually know that the human brain does anything remotely similar to represent concepts."

Here we turn, for a moment, to neuroscience.

Multivariate Cross-Classification (MVCC) is a clever neuroscience methodology used for figuring out whether different neural representations of the same thing have something in common. For example, we may be interested in whether the visual and tactile representation of a banana have something in common.

We can test this by having several test subjects look at pictures of objects such as apples and bananas while sitting in a brain scanner. We then feed the scans of their brains into a machine learning classifier and teach it to distinguish between the neural activity of looking at an apple, versus the neural activity of looking at a banana. Next we have our test subjects (still sitting in the brain scanners) touch some bananas and apples, and ask our machine learning classifier to guess whether the resulting neural activity is the result of touching a banana or an apple. If the classifier - which has not been trained on the "touch" representations, only on the "sight" representations - manages to achieve a better-than-chance performance on this latter task, then we can conclude that the neural representation for e.g. "the sight of a banana" has something in common with the neural representation for "the touch of a banana".

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

Next post in series: The problem of alien concepts.

## [link] Thoughts on defining human preferences

6 31 March 2015 10:08AM

Abstract: Discussion of how we might want to define human preferences, particularly in the context of building an AI intended to learn and implement those preferences. Starts with actual arguments about the applicability of the VNM utility theorem, then towards the end gets into hypotheses that are less well defended but possibly more important. At the very end, suggests that current hypothesizing about AI safety might be overemphasizing “discovering our preferences” over “creating our preferences”.

## Status - is it what we think it is?

20 30 March 2015 09:37PM

I was re-reading the chapter on status in Impro (excerpt), and I noticed that Johnstone seemed to be implying that different people are comfortable at different levels of status: some prefer being high status and others prefer being low status. I found this peculiar, because the prevailing notion in the rationalistsphere seems to be that everyone's constantly engaged in status games aiming to achieve higher status. I've even seen arguments to the effect that a true post-scarcity society is impossible, because status is zero-sum and there will always be people at the bottom of the status hierarchy.

But if some people preferred to have low status, this whole dilemma might be avoided, if a mix of statuses could be find that left everyone happy.

First question - is Johnstone's "status" talking about the same thing as our "status"? He famously claimed that "status is something you do, not something that you are", and that

I should really talk about dominance and submission, but I'd create a resistance. Students who will agree readily to raising or lowering their status may object if asked to 'dominate' or 'submit'.

Viewed via this lens, it makes sense that some people would prefer being in a low status role: if you try to take control of the group, you become subject to various status challenges, and may be held responsible for the decisions you make. It's often easier to remain low status and let others make the decisions.

But there's still something odd about saying that one would "prefer to be low status", at least in the sense in which we usually use the term. Intuitively, a person may be happy being low status in the sense of not being dominant, but most people are still likely to desire something that feels kind of like status in order to be happy. Something like respect, and the feeling that others like them. And a lot of the classical "status-seeking behaviors" seem to be about securing the respect of others. In that sense, there seems to be something intuitive true in the "everyone is engaged in status games and wants to be higher-status" claim.

So I think that there are two different things that we call "status" which are related, but worth distinguishing.

1) General respect and liking. This is "something you have", and is not inherently zero-sum. You can achieve it by doing things that are zero-sum, like being the best fan fiction writer in the country, but you can also do it by things like being considered generally friendly and pleasant to be around. One of the lessons that I picked up from The Charisma Myth was that you can be likable by just being interested in the other person and displaying body language that signals your interest in the other person.

Basically, this is "do other people get warm fuzzies from being around you / hearing about you / consuming your work", and is not zero-sum because e.g. two people who both have great social skills and show interest in you can both produce the same amount of warm fuzzies, independent of each other's existence.

But again, specific sources of this can be zero-sum: if you respect someone a lot for their art, but then run across into even better art and realize that the person you previously admired is pretty poor in comparison, that can reduce the respect you feel for them. It's just that there are also other sources of liking which aren't necessarily zero-sum.

2) Dominance and control of the group. It's inherently zero-sum because at most one person can have absolute say on the decisions of the group. This is "something you do": having the respect and liking of the people in the group (see above) makes it easier for you to assert dominance and makes the others more willing to let you do so, but you can also voluntarily abstain from using that power and leave the decisions to others. (Interestingly, in some cases this can even increase the extent to which you are liked, which translates to a further boost in the ability to control the group, if you so desired.)

---

Morendil and I previously suggested a definition of status as "the general purpose ability to influence a group", but I think that definition was somewhat off in conflating the two senses above.

I've always had the vague feeling that the "everyone can't always be happy because status is zero-sum" claim felt off in some sense that I was unable to properly articulate, but this seems to resolve the issue. If this model were true, it would also make me happy, because it would imply that we can avoid zero-sum status fights while still making everybody content.

## Bayes Academy Development Report 2 - improved data visualization

7 18 December 2014 10:11PM

See here for the previous update if you missed / forgot it.

In this update, no new game content, but new graphics.

I wasn’t terribly happy about the graphical representation of the various nodes in the last update. Especially in the first two networks, if you didn’t read the descriptions of the nodes carefully, it was very easy to just click your way through them without really having a clue of what the network was actually doing. Needless to say, for a game that’s supposed to teach how the networks function, this is highly non-optimal.

Here’s the representation that I’m now experimenting with: the truth table of the nodes is represented graphically inside the node. The prior variable at the top doesn’t really have a truth table, it’s just true or false. The “is” variable at the bottom is true if its parent is true, and false if its parent is false.

You may remember that in the previous update, unobservable nodes were represented in grayscale. I ended up dropping that, because that would have been confusing in this representation: if the parent is unobservable, should the blobs representing its truth values in the child node be in grayscale as well? Both “yes” and “no” answers felt confusing.

Instead the observational state of a node is now represented by its border color. Black for unobservable, gray for observable, no border for observed. The metaphor is supposed to be something like, a border is a veil of ignorance blocking us from seeing the node directly, but if the veil is gray it’s weak enough to be broken, whereas a black veil is strong enough to resist a direct assault. Or something.

When you observe a node, not only does its border disappear, but the truth table entries that get reduced to a zero probability disappear, to be replaced by white boxes. I experimented with having the eliminated entries still show up in grayscale, so you could e.g. see that the “is” node used to contain the entry for (false -> false), but felt that this looked clearer.

The “or” node at the bottom is getting a little crowded, but hopefully not too crowded. Since we know that its value is “true”, the truth table entry showing (false, false -> false) shows up in all whites. It’s also already been observed, so it starts without a border.

After we observe that there’s no monster behind us, the “or” node loses its entries for (monster, !waiting -> looks) and (monster, waiting -> looks), leaving only (!monster, waiting -> looks): meaning that the boy must be waiting for us to answer.

This could still be made clearer: currently the network updates instantly. I’m thinking about adding a brief animation where the “monster” variable would first be revealed as false, which would then propagate an update to the values of “looks at you” (with e.g. the red tile in “monster” blinking at the same time as the now-invalid truth table entries, and when the tiles stopped blinking, those now-invalid entries would have disappeared), and that would in turn propagate the update to the “waiting” node, deleting the red color from it. But I haven’t yet implemented this.

The third network is where things get a little tricky. The “attacking” node is of type “majority vote” - i.e. it’s true if at least two of its parents are true, and false otherwise. That would make for a truth table with eight entries, each holding four blobs each, and we could already see the “or” node in the previous screen being crowded. I’m not quite sure of what to do here. At this moment I’m thinking of just leaving the node as is, and displaying more detailed information in the sidebar.

Here’s another possible problem. Just having the truth table entries works fine to make it obvious where the overall probability of the node comes from… for as long as the valid values of the entries are restricted to “possible” and “impossible”. Then you can see at a glance that, say, of the three possible entries, two would make this node true and one would make this false, so there’s a ⅔ chance of it being true.

But in this screen, that has ceased to be the case. The “attacking” node has a 75% chance of being true, meaning that, for instance, the “is / block” node’s “true -> true” entry also has a 75% chance of being the right one. This isn’t reflected in the truth table visualization. I thought of adding small probability bars under each truth table entry, or having the size of the truth table blobs reflect their probability, but then I’d have to make the nodes even bigger, and it feels like it would easily start looking cluttered again. But maybe it’d be the right choice anyway? Or maybe just put the more detailed information in the sidebar? I’m not sure of the best thing to do here.

If anyone has good suggestions, I would be grateful to get advice from people who have more of a visual designer gene than I do!

12 02 December 2014 03:44AM

The philosopher John Danaher has posted a list of all the posts that he's written on the topic of robotics and AI. Below is the current version of the list: he says that he will keep updating the page as he writes more.

• The Singularity: Overview and Framework: This was my first attempt to provide a general overview and framework for understanding the debate about the technological singularity. I suggested that the debate could be organised around three main theses: (i) the explosion thesis -- which claims that there will be an intelligence explosion; (ii) the unfriendliness thesis -- which claims that an advanced artificial intelligence is likely to be "unfriendly"; and (iii) the inevitability thesis -- which claims that the creation of an unfriendly AI will be difficult to avoid, if not inevitable.
• The Singularity: Overview and Framework Redux: This was my second attempt to provide a general overview and framework for understanding the debate about the technological singularity. I tried to reduce the framework down to two main theses: (i) the explosion thesis and (ii) the unfriendliness thesis.
• AIs and the Decisive Advantage Thesis: Many people claim that an advanced artificial intelligence would have decisive advantages over human intelligences. Is this right? In this post, I look at Kaj Sotala's argument to that effect.
• Is there a case for robot slaves? - If robots can be persons -- in the morally thick sense of "person" -- then surely it would be wrong to make them cater to our every whim? Or would it? Steve Petersen argues that the creation of robot slaves might be morally permissible. In this post, I look at what he has to say.
• The Ethics of Robot Sex: A reasonably self-explanatory title. This post looks at the ethical issues that might arise from the creation of sex robots.
• Will sex workers be replaced by robots? A Precis: A short summary of a longer article examining the possibility of sex workers being replaced by robots. Contrary to the work of others, I suggest that sex work might be resilient to the phenomenon of technological unemployment.
• Bostrom on Superintelligence (2) The Instrumental Convergence Thesis: The second part in my series on Bostrom's book. This one examines the instrumental convergence thesis, according to which an intelligent agent, no matter what its final goals may be, is likely to converge upon certain instrumental goals that are unfriendly to human beings.

## My experience of the recent CFAR workshop

29 27 November 2014 04:17PM

Originally posted at my blog.

---

I just got home from a four-day rationality workshop in England that was organized by the Center For Applied Rationality (CFAR). It covered a lot of content, but if I had to choose a single theme that united most of it, it was listening to your emotions.

That might sound like a weird focus for a rationality workshop, but cognitive science has shown that the intuitive and emotional part of the mind (”System 1”) is both in charge of most of our behavior, and also carries out a great deal of valuable information-processing of its own (it’s great at pattern-matching, for example). Much of the workshop material was aimed at helping people reach a greater harmony between their System 1 and their verbal, logical System 2. Many of people’s motivational troubles come from the goals of their two systems being somehow at odds with each other, and we were taught to have our two systems have a better dialogue with each other, harmonizing their desires and making it easier for information to cross from one system to the other and back.

To give a more concrete example, there was the technique of goal factoring. You take a behavior that you often do but aren’t sure why, or which you feel might be wasted time. Suppose that you spend a lot of time answering e-mails that aren’t actually very important. You start by asking yourself: what’s good about this activity, that makes me do it? Then you try to listen to your feelings in response to that question, and write down what you perceive. Maybe you conclude that it makes you feel productive, and it gives you a break from tasks that require more energy to do.

Next you look at the things that you came up with, and consider whether there’s a better way to accomplish them. There are two possible outcomes here. Either you conclude that the behavior is an important and valuable one after all, meaning that you can now be more motivated to do it. Alternatively, you find that there would be better ways of accomplishing all the goals that the behavior was aiming for. Maybe taking a walk would make for a better break, and answering more urgent e-mails would provide more value. If you were previously using two hours per day on the unimportant e-mails, possibly you could now achieve more in terms of both relaxation and actual productivity by spending an hour on a walk and an hour on the important e-mails.

At this point, you consider your new plan, and again ask yourself: does this feel right? Is this motivating? Are there any slight pangs of regret about giving up my old behavior? If you still don’t want to shift your behavior, chances are that you still have some motive for doing this thing that you have missed, and the feelings of productivity and relaxation aren’t quite enough to cover it. In that case, go back to the step of listing motives.

Or, if you feel happy and content about the new direction that you’ve chosen, victory!

Notice how this technique is all about moving information from one system to another. System 2 notices that you’re doing something but it isn’t sure why that is, so it asks System 1 for the reasons. System 1 answers, ”here’s what I’m trying to do for us, what do you think?” Then System 2 does what it’s best at, taking an analytic approach and possibly coming up with better ways of achieving the different motives. Then it gives that alternative approach back to System 1 and asks, would this work? Would this give us everything that we want? If System 1 says no, System 2 gets back to work, and the dialogue continues until both are happy.

Again, I emphasize the collaborative aspect between the two systems. They’re allies working for common goals, not enemies. Too many people tend towards one of two extremes: either thinking that their emotions are stupid and something to suppress, or completely disdaining the use of logical analysis. Both extremes miss out on the strengths of the system that is neglected, and make it unlikely for the person to get everything that they want.

As I was heading back from the workshop, I considered doing something that I noticed feeling uncomfortable about. Previous meditation experience had already made me more likely to just attend to the discomfort rather than trying to push it away, but inspired by the workshop, I went a bit further. I took the discomfort, considered what my System 1 might be trying to warn me about, and concluded that it might be better to err on the side of caution this time around. Finally – and this wasn’t a thing from the workshop, it was something I invited on the spot – I summoned a feeling of gratitude and thanked my System 1 for having been alert and giving me the information. That might have been a little overblown, since neither system should actually be sentient by itself, but it still felt like a good mindset to cultivate.

Although it was never mentioned in the workshop, what comes to mind is the concept of wu-wei from Chinese philosophy, a state of ”effortless doing” where all of your desires are perfectly aligned and everything comes naturally. In the ideal form, you never need to force yourself to do something you don’t want to do, or to expend willpower on an unpleasant task. Either you want to do something and do, or don’t want to do it, and don’t.

A large number of the workshop’s classes – goal factoring, aversion factoring and calibration, urge propagation, comfort zone expansion, inner simulation, making hard decisions, Hamming questions, againstness – were aimed at more or less this. Find out what System 1 wants, find out what System 2 wants, dialogue, aim for a harmonious state between the two. Then there were a smaller number of other classes that might be summarized as being about problem-solving in general.

The classes about the different techniques were interspersed with ”debugging sessions” of various kinds. In the beginning of the workshop, we listed different bugs in our lives – anything about our lives that we weren’t happy with, with the suggested example bugs being things like ”every time I talk to so-and-so I end up in an argument”, ”I think that I ‘should’ do something but don’t really want to”, and ”I’m working on my dissertation and everything is going fine – but when people ask me why I’m doing a PhD, I have a hard time remembering why I wanted to”. After we’d had a class or a few, we’d apply the techniques we’d learned to solving those bugs, either individually, in pairs, or small groups with a staff member or volunteer TA assisting us. Then a few more classes on techniques and more debugging, classes and debugging, and so on.

The debugging sessions were interesting. Often when you ask someone for help on something, they will answer with direct object-level suggestions – if your problem is that you’re underweight and you would like to gain some weight, try this or that. Here, the staff and TAs would eventually get to the object-level advice as well, but first they would ask – why don’t you want to be underweight? Okay, you say that you’re not completely sure but based on the other things that you said, here’s a stupid and quite certainly wrong theory of what your underlying reasons for it might be, how does that theory feel like? Okay, you said that it’s mostly on the right track, so now tell me what’s wrong with it? If you feel that gaining weight would make you more attractive, do you feel that this is the most effective way of achieving that?

Only after you and the facilitator had reached some kind of consensus of why you thought that something was a bug, and made sure that the problem you were discussing was actually the best way to address to reasons, would it be time for the more direct advice.

At first, I had felt that I didn’t have very many bugs to address, and that I had mostly gotten reasonable advice for them that I might try. But then the workshop continued, and there were more debugging sessions, and I had to keep coming up with bugs. And then, under the gentle poking of others, I started finding the underlying, deep-seated problems, and some things that had been motivating my actions for the last several months without me always fully realizing it. At the end, when I looked at my initial list of bugs that I’d come up with in the beginning, most of the first items on the list looked hopelessly shallow compared to the later ones.

Often in life you feel that your problems are silly, and that you are affected by small stupid things that ”shouldn’t” be a problem. There was none of that at the workshop: it was tacitly acknowledged that being unreasonably hindered by ”stupid” problems is just something that brains tend to do.  Valentine, one of the staff members, gave a powerful speech about ”alienated birthrights” – things that all human beings should be capable of engaging in and enjoying, but which have been taken from people because they have internalized beliefs and identities that say things like ”I cannot do that” or ”I am bad at that”. Things like singing, dancing, athletics, mathematics, romantic relationships, actually understanding the world, heroism, tackling challenging problems. To use his analogy, we might not be good at these things at first, and may have to grow into them and master them the way that a toddler grows to master her body. And like a toddler who’s taking her early steps, we may flail around and look silly when we first start doing them, but these are capacities that – barring any actual disabilities – are a part of our birthright as human beings, which anyone can ultimately learn to master.

Then there were the people, and the general atmosphere of the workshop. People were intelligent, open, and motivated to work on their problems, help each other, and grow as human beings. After a long, cognitively and emotionally exhausting day at the workshop, people would then shift to entertainment ranging from wrestling to telling funny stories of their lives to Magic: the Gathering. (The game of ”bunny” was an actual scheduled event on the official agenda.) And just plain talk with each other, in a supportive, non-judgemental atmosphere. It was the people and the atmosphere that made me the most reluctant to leave, and I miss them already.

Would I recommend CFAR’s workshops to others? Although my above description may sound rather gushingly positive, my answer still needs to be a qualified ”mmmaybe”. The full price tag is quite hefty, though financial aid is available and I personally got a very substantial scholarship, with the agreement that I would pay it at a later time when I could actually afford it.

Still, the biggest question is, will the changes from the workshop stick? I feel like I have gained a valuable new perspective on emotions, a number of useful techniques, made new friends, strengthened my belief that I can do the things that I really set my mind on, and refined the ways by which I think of the world and any problems that I might have – but aside for the new friends, all of that will be worthless if it fades away in a week. If it does, I would have to judge even my steeply discounted price as ”not worth it”. That said, the workshops do have a money-back guarantee if you’re unhappy with the results, so if it really feels like it wasn’t worth it, I can simply choose to not pay. And if all the new things do end up sticking, it might still turn out that it would have been worth paying even the full, non-discounted price.

CFAR does have a few ways by which they try to make the things stick. There will be Skype follow-ups with their staff, for talking about how things have been going since the workshop. There is a mailing list for workshop alumni, and the occasional events, though the physical events are very US-centric (and in particular, San Francisco Bay Area-centric).

The techniques that we were taught are still all more or less experimental, and are being constantly refined and revised according to people’s experiences. I have already been thinking of a new skill that I had been playing with for a while before the workshop, and which has a bit of that ”CFAR feel” – I will aim to have it written up soon and sent to the others, and maybe it will eventually make its way to the curriculum of a future workshop. That should help keep me engaged as well.

We shall see. Until then, as they say in CFAR – to victory!

## Bayes Academy: Development report 1

47 19 November 2014 10:35PM

Some of you may remember me proposing a game idea that went by the name of The Fundamental Question. Some of you may also remember me talking a lot about developing an educational game about Bayesian Networks for my MSc thesis, but not actually showing you much in the way of results.

Insert the usual excuses here. But thanks to SSRIs and mytomatoes.com and all kinds of other stuff, I'm now finally on track towards actually accomplishing something. Here's a report on a very early prototype.

This game has basically two goals: to teach its players something about Bayesian networks and probabilistic reasoning, and to be fun. (And third, to let me graduate by giving me material for my Master's thesis.)

We start with the main character stating that she is nervous. Hitting any key, the player proceeds through a number of lines of internal monologue:

I am nervous.

I’m standing at the gates of the Academy, the school where my brother Opin was studying when he disappeared. When we asked the school to investigate, they were oddly reluctant, and told us to drop the issue.

The police were more helpful at first, until they got in contact with the school. Then they actually started threatening us, and told us that we would get thrown in prison if we didn’t forget about Opin.

That was three years ago. Ever since it happened, I’ve been studying hard to make sure that I could join the Academy once I was old enough, to find out what exactly happened to Opin. The answer lies somewhere inside the Academy gates, I’m sure of it.

Now I’m finally 16, and facing the Academy entrance exams. I have to do everything I can to pass them, and I have to keep my relation to Opin a secret, too.

???: “Hey there.”

Eep! Someone is talking to me! Is he another applicant, or a staff member? Wait, let me think… I’m guessing that applicant would look a lot younger than staff members! So, to find that out… I should look at him!

[You are trying to figure out whether the voice you heard is a staff member or another applicant. While you can't directly observe his staff-nature, you believe that he'll look young if he's an applicant, and like an adult if he's a staff member. You can look at him, and therefore reveal his staff-nature, by right-clicking on the node representing his apperance.]

Here is our very first Bayesian Network! Well, it's not really much of a network: I'm starting with the simplest possible case in order to provide an easy start for the player. We have one node that cannot be observed ("Student", its hidden nature represented by showing it in greyscale), and an observable node ("Young-looking") whose truth value is equal to that of the Student node. All nodes are binary random variables, either true or false.

According to our current model of the world, "Student" has a 50% chance of being true, so it's half-colored in white (representing the probability of it being true) and half-colored in black (representing the probability of it being false). "Young-looking" inherits its probability directly. The player can get a bit of information about the two nodes by left-clicking on them.

The game also offers alternate color schemes for colorblind people who may have difficulties distinguishing red and green.

Now we want to examine the person who spoke to us. Let's look at him, by right-clicking on the "Young-looking" node.

Not too many options here, because we're just getting started. Let's click on "Look at him", and find out that he is indeed young, and thus a student.

This was the simplest type of minigame offered within the game. You are given a set of hidden nodes whose values you're tasked with discovering by choosing which observable nodes to observe. Here the player had no way to fail, but later on, the minigames will involve a time limit and too many observable nodes to inspect within that time limit. It then becomes crucial to understand how probability flows within a Bayesian network, and which nodes will actually let you know the values of the hidden nodes.

The story continues!

Short for an adult, face has boyish look, teenagerish clothes... yeah, he looks young!

He's a student!

...I feel like I’m overthinking things now.

...he’s looking at me.

I’m guessing he’s either waiting for me to respond, or there’s something to see behind me, and he’s actually looking past me. If there isn’t anything behind me, then I know that he must be waiting for me to respond.

Maybe there's a monster behind me, and he's paralyzed with fear! I should check that possibility before it eats me!

[You want to find out whether the boy is waiting for your reply or staring at a monster behind you. You know that he's looking at you, and your model of the world suggests that he will only look in your direction if he's waiting for you to reply, or if there's a monster behind you. So if there's no monster behind you, you know that he's waiting for you to reply!]

Slightly more complicated network, but still, there's only one option here. Oops, apparently the "Looks at you" node says it's an observable variable that you can right-click to observe, despite the fact that it's already been observed. I need to fix that.

Anyway, right-clicking on "Attacking monster" brings up a "Look behind you" option, which we'll choose.

You see nothing there. Besides trees, that is.

Boy: “Um, are you okay?”

“Yeah, sorry. I just… you were looking in my direction, and I wasn’t sure of whether you were expecting me to reply, or whether there was a monster behind me.”

Boy: “You thought that there was a reasonable chance for a monster to be behind you?”

I’m embarrassed to admit it, but I’m not really sure of what the probability of a monster having snuck up behind me really should have been.

My studies have entirely focused on getting into this school, and Monsterology isn’t one of the subjects on the entrance exam!

I just went with a 50-50 chance since I didn’t know any better.

'Okay, look. Monsterology is my favorite subject. Monsters avoid the Academy, since it’s surrounded by a mystical protective field. There’s no chance of them getting even near! 0 percent chance.'

'Oh. Okay.'

[Your model of the world has been updated! The prior of the variable 'Monster Near The Academy' is now 0%.]

Then stuff happens and they go stand in line for the entrance exam or something. I haven't written this part. Anyway, then things get more exciting, for a wild monster appears!

Stuff happens

AAAAAAH! A MONSTER BEHIND ME!

Huh, the monster is carrying a sword.

Well, I may not have studied Monsterology, but I sure did study fencing!

[You draw your sword. Seeing this, the monster rushes at you.]

He looks like he's going to strike. But is it really a strike, or is it a feint?

If it's a strike, I want to block and counter-attack. But if it's a feint, that leaves him vulnerable to my attack.

I have to choose wisely. If I make the wrong choice, I may be dead.

What did my master say? If the opponent has at least two of dancing legs, an accelerating midbody, and ferocious eyes, then it's an attack!

Otherwise it's a feint! Quick, I need to read his body language before it's too late!

Now get to the second type of minigame! Here, you again need to discover the values of some number of hidden variables within a time limit, but here it is in order to find out the consequences of your decision. In this one, the consequence is simple - either you live or you die. I'll let the screenshot and tutorial text speak for themselves:

[Now for some actual decision-making! The node in the middle represents the monster's intention to attack (or to feint, if it's false). Again, you cannot directly observe his intention, but on the top row, there are things about his body language that signal his intention. If at least two of them are true, then he intends to attack.]

[Your possible actions are on the bottom row. If he intends to attack, then you want to block, and if he intends to feint, you want to attack. You need to inspect his body language and then choose an action based on his intentions. But hurry up! Your third decision must be an action, or he'll slice you in two!]

In reality, the top three variables are not really independent of each other. We want to make sure that the player can always win this battle despite only having three actions. That's two actions for inspecting variables, and one action for actually making a decision. So this battle is rigged: either the top three variables are all true, or they're all false.

...actually, now that I think of it, the order of the variables is wrong. Logically, the body language should be caused by the intention to attack, and not vice versa, so the arrows should point from the intention to body language. I'll need to change that. I got these mixed up because the prototypical exemplar of a decision minigame is one where you need to predict someone's reaction from their personality traits, and there the personality traits do cause the reaction. Anyway, I want to get this post written before I go to bed, so I won't change that now.

Right-clicking "Dancing legs", we now see two options besides "Never mind"!

We can find out the dancingness of the enemy's legs by thinking about our own legs - we are well-trained, so our legs are instinctively mirroring our opponent's actions to prevent them from getting an advantage over us - or by just instinctively feeling where they are, without the need to think about them! Feeling them would allow us to observe this node without spending an action.

Unfortunately, feeling them has "Fencing 2" as a prerequisite skill, and we don't have that. Neither could we have them, in this point of the game. The option is just there to let the player know that there are skills to be gained in this game, and make them look forward to the moment when they can actually gain that skill. As well as giving them an idea of how the skill can be used.

Anyway, we take a moment to think of our legs, and even though our opponent gets closer to us in that time, we realize that our legs our dancing! So his legs must be dancing as well!

With our insider knowledge, we now know that he's attacking, and we could pick "Block" right away. But let's play this through. The network has automatically recalculated the probabilities to reflect our increased knowledge, and is now predicting a 75% chance for our enemy to be attacking, and for "Blocking" to thus be the right decision to make.

Next we decide to find out what his eyes say, by matching our gaze with his. Again, there would be a special option that cost us no time - this time around, one enabled by Empathy 1 - but we again don't have that option.

Except that his gaze is so ferocious that we are forced to look away! While we are momentarily distracted, he closes the distance, ready to make his move. But now we know what to do... block!

Success!

Now the only thing that remains to do is to ask our new-found friend for an explanation.

"You told me there was a 0% chance of a monster near the academy!"

Boy: “Ehh… yeah. I guess I misremembered. I only read like half of our course book anyway, it was really boring.”

“Didn’t you say that Monsterology was your favorite subject?”

Boy: “Hey, that only means that all the other subjects were even more boring!”

“. . .”

I guess I shouldn’t put too much faith on what he says.

[Your model of the world has been updated! The prior of the variable 'Monster Near The Academy' is now 50%.]

[Your model of the world has been updated! You have a new conditional probability variable: 'True Given That The Boy Says It's True', 25%]

And that's all for now. Now that the basic building blocks are in place, future progress ought to be much faster.

Notes:

As you might have noticed, my "graphics" suck. A few of my friends have promised to draw art, but besides that, the whole generic Java look could go. This is where I was originally planning to put in the sentence "and if you're a Java graphics whiz and want to help fix that, the current source code is conveniently available at GitHub", but then getting things to his point took longer than I expected and I didn't have the time to actually figure out how the whole Eclipse-GitHub integration works. I'll get to that soon. Github link here!

I also want to make the nodes more informative - right now they only show their marginal probability. Ideally, clicking on them would expand them to a representation where you could visually see what components their probability composed of. I've got some scribbled sketches of what this should look like for various node types, but none of that is implemented yet.

I expect some of you to also note that the actual Bayes theorem hasn't shown up yet, at least in no form resembling the classic mammography problem. (It is used implicitly in the network belief updates, though.) That's intentional - there will be a third minigame involving that form of the theorem, but somehow it felt more natural to start this way, to give the player a rough feeling of how probability flows through Bayesian networks. Admittedly I'm not sure of how well that's happening so far, but hopefully more minigames should help the player figure it out better.

What's next? Once the main character (who needs a name) manages to get into the Academy, there will be a lot of social scheming, and many mysteries to solve in order for her to find out just what did happen to her brother... also, I don't mind people suggesting things, such as what could happen next, and what kinds of network configurations the character might face in different minigames.

(Also, everything that you've seen might get thrown out and rewritten if I decide it's no good. Let me know what you think of the stuff so far!)

View more: Next