Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Concept Safety: What are concepts for, and how to deal with alien concepts

10 Kaj_Sotala 19 April 2015 01:44PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?

I gave some (more or less) concrete examples of this kind of a "conceptual expansion":

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
  3. As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
  4. An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.

As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.

The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.

As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.

We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.

Concepts, values, and reinforcement learning

Before we go on, we need to talk a bit about why we have concepts and values in the first place.

From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.

How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)

The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.

There are two major reasons why an RL agent, like a human, should learn high-level concepts:

  1. They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
  2. There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.

Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s1 gives you a reward, and that there are also states s2 - s5 in which taking some specific action causes you to end up in s1. Then we should expect the creature to develop a common concept for being in the states s2 - s5, and we should expect that concept to be "more similar" to the concept of being in state s1 than to the concept of being in some state that was many actions away.

"More similar" how?

In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.

Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.

As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).

Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".

This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.

Getting back to conceptual expansion

Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.

I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".

Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.

As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.

Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.

As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)

The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.

Dealing with the AI in the box

This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.

Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?

Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?

Maybe some do. But I would wager that most don't. Why not?

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.

In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward. 

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

High impact from low impact

5 Stuart_Armstrong 17 April 2015 04:01PM

Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.

Some of the designs try and get around the problem by allowing a special "output channel" on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.

Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact - and saving the planet from an asteroid with certainty is not reduced impact.

Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the event that the output signal from X is scrambled away from the the original output.

Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y's signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.

Then X will reason "since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall." Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!

The approach is not fully general yet, because we can have "subagent problems". X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.

EDIT: For those worried about lying to the AIs, do recall http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/

Concept Safety: The problem of alien concepts

14 Kaj_Sotala 17 April 2015 02:09PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In the previous post in this series, I talked about how one might get an AI to have similar concepts as humans. However, one would intuitively assume that a superintelligent AI might eventually develop the capability to entertain far more sophisticated concepts than humans would ever be capable of having. Is that a problem?

Just what are concepts, anyway?

To answer the question, we first need to define what exactly it is that we mean by a "concept", and why exactly more sophisticated concepts would be a problem.

Unfortunately, there isn't really any standard definition of this in the literature, with different theorists having different definitions. Machery even argues that the term "concept" doesn't refer to a natural kind, and that we should just get rid of the whole term. If nothing else, this definition from Kruschke (2008) is at least amusing:

Models of categorization are usually designed to address data from laboratory experiments, so “categorization” might be best defined as the class of behavioral data generated by experiments that ostensibly study categorization.

Because I don't really have the time to survey the whole literature and try to come up with one grand theory of the subject, I will for now limit my scope and only consider two compatible definitions of the term.

Definition 1: Concepts as multimodal neural representations. I touched upon this definition in the last post, where I mentioned studies indicating that the brain seems to have shared neural representations for e.g. the touch and sight of a banana. Current neuroscience seems to indicate the existence of brain areas where representations from several different senses are combined together into higher-level representations, and where the activation of any such higher-level representation will also end up activating the lower sense modalities in turn. As summarized by Man et al. (2013):

Briefly, the Damasio framework proposes an architecture of convergence-divergence zones (CDZ) and a mechanism of time-locked retroactivation. Convergence-divergence zones are arranged in a multi-level hierarchy, with higher-level CDZs being both sensitive to, and capable of reinstating, specific patterns of activity in lower-level CDZs. Successive levels of CDZs are tuned to detect increasingly complex features. Each more-complex feature is defined by the conjunction and configuration of multiple less-complex features detected by the preceding level. CDZs at the highest levels of the hierarchy achieve the highest level of semantic and contextual integration, across all sensory modalities. At the foundations of the hierarchy lie the early sensory cortices, each containing a mapped (i.e., retinotopic, tonotopic, or somatotopic) representation of sensory space. When a CDZ is activated by an input pattern that resembles the template for which it has been tuned, it retro-activates the template pattern of lower-level CDZs. This continues down the hierarchy of CDZs, resulting in an ensemble of well-specified and time-locked activity extending to the early sensory cortices.

On this account, my mental concept for "dog" consists of a neural activation pattern making up the sight, sound, etc. of some dog - either a generic prototypical dog or some more specific dog. Likely the pattern is not just limited to sensory information, either, but may be associated with e.g. motor programs related to dogs. For example, the program for throwing a ball for the dog to fetch. One version of this hypothesis, the Perceptual Symbol Systems account, calls such multimodal representations simulators, and describes them as follows (Niedenthal et al. 2005):

A simulator integrates the modality-specific content of a category across instances and provides the ability to identify items encountered subsequently as instances of the same category. Consider a simulator for the social category, politician. Following exposure to different politicians, visual information about how typical politicians look (i.e., based on their typical age, sex, and role constraints on their dress and their facial expressions) becomes integrated in the simulator, along with auditory information for how they typically sound when they talk (or scream or grovel), motor programs for interacting with them, typical emotional responses induced in interactions or exposures to them, and so forth. The consequence is a system distributed throughout the brain’s feature and association areas that essentially represents knowledge of the social category, politician.

The inclusion of such "extra-sensory" features helps understand how even abstract concepts could fit this framework: for example, one's understanding of the concept of a derivative might be partially linked to the procedural programs one has developed while solving derivatives. For a more detailed hypothesis of how abstract mathematics may emerge from basic sensory and motor programs and concepts, I recommend Lakoff & Nuñez (2001).

Definition 2: Concepts as areas in a psychological space. This definition, while being compatible with the previous one, looks at concepts more "from the inside". Gärdenfors (2000) defines the basic building blocks of a psychological conceptual space to be various quality dimensions, such as temperature, weight, brightness, pitch, and the spatial dimensions of height, width, and depth. These are psychological in the sense of being derived from our phenomenal experience of certain kinds of properties, rather than the way in which they might exist in some objective reality.

For example, one way of modeling the psychological sense of color is via a color space defined by the quality dimensions of hue (represented by the familiar color circle), chromaticness (saturation), and brightness.

The second phenomenal dimension of color is chromaticness (saturation), which ranges from grey (zero color intensity) to increasingly greater intensities. This dimension is isomorphic to an interval of the real line. The third dimension is brightness which varies from white to black and is thus a linear dimension with two end points. The two latter dimensions are not totally independent, since the possible variation of the chromaticness dimension decreases as the values of the brightness dimension approaches the extreme points of black and white, respectively. In other words, for an almost white or almost black color, there can be very little variation in its chromaticness. This is modeled by letting that chromaticness and brightness dimension together generate a triangular representation ... Together these three dimensions, one with circular structure and two with linear, make up the color space. This space is often illustrated by the so called color spindle

This kind of a representation is different from the physical wavelength representation of color, where e.g. the hue is mostly related to the wavelength of the color. The wavelength representation of hue would be linear, but due to the properties of the human visual system, the psychological representation of hue is circular.

Gärdenfors defines two quality dimensions to be integral if a value cannot be given for an object on one dimension without also giving it a value for the other dimension: for example, an object cannot be given a hue value without also giving it a brightness value. Dimensions that are not integral with each other are separable. A conceptual domain is a set of integral dimensions that are separable from all other dimensions: for example, the three color-dimensions form the domain of color.

From these definitions, Gärdenfors develops a theory of concepts where more complicated conceptual spaces can be formed by combining lower-level domains. Concepts, then, are particular regions in these conceptual spaces: for example, the concept of "blue" can be defined as a particular region in the domain of color. Notice that the notion of various combinations of basic perceptual domains making more complicated conceptual spaces possible fits well together with the models discussed in our previous definition. There more complicated concepts were made possible by combining basic neural representations for e.g. different sensory modalities.

The origin of the different quality dimensions could also emerge from the specific properties of the different simulators, as in PSS theory.

Thus definition #1 allows us to talk about what a concept might "look like from the outside", with definition #2 talking about what the same concept might "look like from the inside".

Interestingly, Gärdenfors hypothesizes that much of the work involved with learning new concepts has to do with learning new quality dimensions to fit into one's conceptual space, and that once this is done, all that remains is the comparatively much simpler task of just dividing up the new domain to match seen examples.

For example, consider the (phenomenal) dimension of volume. The experiments on "conservation" performed by Piaget and his followers indicate that small children have no separate representation of volume; they confuse the volume of a liquid with the height of the liquid in its container. It is only at about the age of five years that they learn to represent the two dimensions separately. Similarly, three- and four-year-olds confuse high with tall, big with bright, and so forth (Carey 1978).

The problem of alien concepts

With these definitions for concepts, we can now consider what problems would follow if we started off with a very human-like AI that had the same concepts as we did, but then expanded its conceptual space to allow for entirely new kinds of concepts. This could happen if it self-modified to have new kinds of sensory or thought modalities that it could associate its existing concepts with, thus developing new kinds of quality dimensions.

An analogy helps demonstrate this problem: suppose that you're operating in a two-dimensional space, where a rectangle has been drawn to mark a certain area as "forbidden" or "allowed". Say that you're an inhabitant of Flatland. But then you suddenly become aware that actually, the world is three-dimensional, and has a height dimension as well! That raises the question of, how should the "forbidden" or "allowed" area be understood in this new three-dimensional world? Do the walls of the rectangle extend infinitely in the height dimension, or perhaps just some certain distance in it? If just a certain distance, does the rectangle have a "roof" or "floor", or can you just enter (or leave) the rectangle from the top or the bottom? There doesn't seem to be any clear way to tell.

As a historical curiosity, this dilemma actually kind of really happened when airplanes were invented: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? Courts and legislation eventually settled on the latter answer. A more AI-relevant example might be if one was trying to limit the AI with rules such as "stay within this box here", and the AI then gained an intuitive understanding of quantum mechanics, which might allow it to escape from the box without violating the rule in terms of its new concept space.

More generally, if previously your concepts had N dimensions and now they have N+1, you might find something that fulfilled all the previous criteria while still being different from what we'd prefer if we knew about the N+1th dimension.

In the next post, I will present some (very preliminary and probably wrong) ideas for solving this problem.

Next post in series: What are concepts for, and how to deal with alien concepts.

Concept Safety: Producing similar AI-human concept spaces

27 Kaj_Sotala 14 April 2015 08:39PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner. For example, Armstrong, Sandberg & Bostrom (2012) consider the possibility of restricting an AI via "rule-based motivational control" and programming it to follow restrictions like "stay within this lead box here", but they raise worries about the difficulty of rigorously defining "this lead box here". To address this, they go on to consider the possibility of making an AI internalize human concepts via feedback, with the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that. The authors are however worried that this may fail, because

Humans seem quite adept at constructing the correct generalisations – most of us have correctly deduced what we should/should not be doing in general situations (whether or not we follow those rules). But humans share a common of genetic design, which the OAI would likely not have. Sharing, for instance, derives partially from genetic predisposition to reciprocal altruism: the OAI may not integrate the same concept as a human child would. Though reinforcement learning has a good track record, it is neither a panacea nor a guarantee that the OAIs generalisations agree with ours.

Addressing this, a possibility that I raised in Sotala (2015) was that possibly the concept-learning mechanisms in the human brain are actually relatively simple, and that we could replicate the human concept learning process by replicating those rules. I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms. Later on the post, I will discuss how one might try to verify that similar representations had in fact been learned, and how to set up a system to make them even more similar.

Word embedding

"Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word." (Mikolov et al. 2013)A particularly fascinating branch of recent research relates to the learning of word embeddings, which are mappings of words to very high-dimensional vectors. It turns out that if you train a system on one of several kinds of tasks, such as being able to classify sentences as valid or invalid, this builds up a space of word vectors that reflects the relationships between the words. For example, there seems to be a male/female dimension to words, so that there's a "female vector" that we can add to the word "man" to get "woman" - or, equivalently, which we can subtract from "woman" to get "man". And it so happens (Mikolov, Yih & Zweig 2013) that we can also get from the word "king" to the word "queen" by adding the same vector to "king". In general, we can (roughly) get to the male/female version of any word vector by adding or subtracting this one difference vector!

Why would this happen? Well, a learner that needs to classify sentences as valid or invalid needs to classify the sentence "the king sat on his throne" as valid while classifying the sentence "the king sat on her throne" as invalid. So including a gender dimension on the built-up representation makes sense.

But gender isn't the only kind of relationship that gets reflected in the geometry of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of training mechanism, a lot of relationships that we're intuitively aware of become automatically learned and represented in the concept geometry. And like Olah (2014) comments:

It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.

This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.

It gets even more interesting, for we can use these for translation. Since Olah has already written an excellent exposition of this, I'll just quote him:

We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.

We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.

Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.

In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.

Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.

After this, it gets even more interesting. Suppose you had this space of word vectors, and then you also had a system which translated images into vectors in the same space. If you have images of dogs, you put them near the word vector for dog. If you have images of Clippy you put them near word vector for "paperclip". And so on.

You do that, and then you take some class of images the image-classifier was never trained on, like images of cats. You ask it to place the cat-image somewhere in the vector space. Where does it end up? 

You guessed it: in the rough region of the "cat" words. Olah once more:

This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.

The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.

These algorithms made no attempt of being biologically realistic in any way. They didn't try classifying data the way the brain does it: they just tried classifying data using whatever worked. And it turned out that this was enough to start constructing a multimodal representation space where a lot of the relationships between entities were similar to the way humans understand the world.

How useful is this?

"Well, that's cool", you might now say. "But those word spaces were constructed from human linguistic data, for the purpose of predicting human sentences. Of course they're going to classify the world in the same way as humans do: they're basically learning the human representation of the world. That doesn't mean that an autonomously learning AI, with its own learning faculties and systems, is necessarily going to learn a similar internal representation, or to have similar concepts."

This is a fair criticism. But it is mildly suggestive of the possibility that an AI that was trained to understand the world via feedback from human operators would end up building a similar conceptual space. At least assuming that we chose the right learning algorithms.

When we train a language model to classify sentences by labeling some of them as valid and others as invalid, there's a hidden structure implicit in our answers: the structure of how we understand the world, and of how we think of the meaning of words. The language model extracts that hidden structure and begins to classify previously unseen things in terms of those implicit reasoning patterns. Similarly, if we gave an AI feedback about what kinds of actions counted as "leaving the box" and which ones didn't, there would be a certain way of viewing and conceptualizing the world implied by that feedback, one which the AI could learn.

Comparing representations

"Hmm, maaaaaaaaybe", is your skeptical answer. "But how would you ever know? Like, you can test the AI in your training situation, but how do you know that it's actually acquired a similar-enough representation and not something wildly off? And it's one thing to look at those vector spaces and claim that there are human-like relationships among the different items, but that's still a little hand-wavy. We don't actually know that the human brain does anything remotely similar to represent concepts."

Here we turn, for a moment, to neuroscience.

From Kaplan, Man & Greening (2015): "In this example, subjects either see or touch two classes of objects, apples and bananas. (A) First, a classifier is trained on the labeled patterns of neural activity evoked by seeing the two objects. (B) Next, the same classifier is given unlabeled data from when the subject touches the same objects and makes a prediction. If the classifier, which was trained on data from vision, can correctly identify the patterns evoked by touch, then we conclude that the representation is modality invariant."Multivariate Cross-Classification (MVCC) is a clever neuroscience methodology used for figuring out whether different neural representations of the same thing have something in common. For example, we may be interested in whether the visual and tactile representation of a banana have something in common.

We can test this by having several test subjects look at pictures of objects such as apples and bananas while sitting in a brain scanner. We then feed the scans of their brains into a machine learning classifier and teach it to distinguish between the neural activity of looking at an apple, versus the neural activity of looking at a banana. Next we have our test subjects (still sitting in the brain scanners) touch some bananas and apples, and ask our machine learning classifier to guess whether the resulting neural activity is the result of touching a banana or an apple. If the classifier - which has not been trained on the "touch" representations, only on the "sight" representations - manages to achieve a better-than-chance performance on this latter task, then we can conclude that the neural representation for e.g. "the sight of a banana" has something in common with the neural representation for "the touch of a banana".

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

Next post in series: The problem of alien concepts.

Anti-Pascaline satisficer

3 Stuart_Armstrong 14 April 2015 06:49PM

It occurred to me that the anti-Pascaline agent design could be used as part of a satisficer approach.

The obvious thing to reduce dangerous optimisation pressure is to make a bounded utility function, with an easily achievable bound. Such as giving them a utility linear in paperclips that maxs out at 10.

The problem with this is that, if the entity is a maximiser (which it might become), it can never be sure that it's achieved its goals. Even after building 10 paperclips, and an extra 2 to be sure, and an extra 20 to be really sure, and an extra 3^^^3 to be really really sure, and extra cameras to count them, with redundant robots patrolling the cameras to make sure that they're all behaving well, etc... There's still an ε chance that it might have just dreamed this, say, or that its memory is faulty. So it has a current utility of (1-ε)10, and can increase this by reducing ε - hence by building even more paperclips.

Hum... ε, you say? This seems a place where the anti-Pascaline design could help. Here we would use it at the lower bound of utility. It currently has probability ε of having utility < 10 (ie it has not built 10 paperclips) and (1-ε) of having utility = 10. Therefore and anti-Pascaline agent with ε lower bound would round this off to 10, discounting the unlikely event that it has been deluded, and thus it has no need to build more paperclips or paperclip counting devices.

Note that this is an un-optimising approach, not an anti-optimising one, so the agent may still build more paperclips anyway - it just has no pressure to do so.

Un-optimised vs anti-optimised

6 Stuart_Armstrong 14 April 2015 06:30PM

This post contains no new insights; it just puts together some old insights in a format I hope is clearer.

Most satisficers are unoptimised (above the satisficing level): they have a limited drive to optimise and transform the universe. They may still end up optimising the universe anyway: they have no penalty for doing so (and sometimes it's a good idea for them). But if they can lazily achieve their goal, then they're ok with that too. So they simply have low optimisation pressure.

A safe "satisficer" design (or a reduced impact AI design) needs to be not only un-optimised, but specifically anti-optimised. It has to be setup so that "go out and optimise the universe" scores worse that "be lazy and achieve your goal". The problem is that these terms are undefined (as usual), that there are many minor actions that can optimise the universe (such as creating a subagent), and the approach has to be safe against all possible ways of optimising the universe - not just the "maximise u" for a specific and known u.

That's why the reduced impact/safe satisficer/anti-optimised designs are so hard: you have to add a very precise yet general (anti-)optimising pressure, rather than simply removing the current optimising pressure.

Could you tell me what's wrong with this?

1 Algon 14 April 2015 10:43AM

Edit: Some people have misunderstood my intentions here. I do not in any way expect this to be the NEXT GREAT IDEA. I just couldn't see anything wrong with this, which almost certainly meant there were gaps in my knowledge. I thought the fastest way to see where I went wrong would be to post my idea here and see what people say. I apologise for any confusion I caused. I'll try to be more clear next time.

(I really can't think of any major problems in this, so I'd be very grateful if you guys could tell me what I've done wrong). 

So, a while back I was listening to a discussion about the difficulty of making an FAI. One of the ways that was suggested to circumvent this was to go down the route of programming an AGI to solve FAI. Someone else pointed out the problems with this. Amongst other things one would have no idea what the AI will do in pursuit of its primary goal. Furthermore, it would already be a monumental task to program an AI whose primary goal is to solve the FAI problem; doing this is still easier than solving FAI, I should think. 

So, I started to think about this for a little while, and I thought 'how could you make this safer?' Well, first of, you don't want an AI who completely outclasses humanity in terms of intellect. If things went Wrong, you'd have little chance of stopping it. So, you want to limit the AI's intellect to genius level, so if something did go Wrong, then the AI would not be unstoppable. It may do quite a bit of damage, but a large group of intelligent people with a lot of resources on their hands could stop it. 

 Therefore, what must be done is that the AI cannot modify parts of its source code. You must try and stop an intelligence explosion from taking off. So, limited access to its source code, and a limit on how much computing power it can have on hand. This is problematic though, because the AI would not be able to solve FAI very quickly. After all, we have a few genius level people trying to solve FAI, and they're struggling with it, so why should a genius level computer do any better. Well, an AI would have fewer biases, and could accumulate much more expertise relevant to the task at hand. It would be about as capable as solving FAI as the most capable human could possibly be; perhaps even more so. Essentially, you'd get someone like Turing, Von Neumann, Newton and others all rolled into one working on FAI. 

 But, there's still another problem. The AI, if left for 20 years working on FAI for 20 years let's say, would have accumulated enough skills that it would be able to cause major problems if something went wrong. Sure, it would be as intelligent as Newton, but it would be far more skilled. Humanity fighting against it would be like sending a young Miyamoto Musashi against his future self at his zenith i.e. completely one sided. 

 What must be done then, is the AI must have a time limit of a few years (or less) and after that time is past, it is put to sleep. We look at what it accomplished, see what worked and what didn't, and boot up a fresh version of the AI with any required modifications, and tell it what the old AI did. Repeat the process for a few years, and we should end up with FAI solved. 

After that, we just make an FAI, and wake up the originals, since there's no point in killing them off at this point. 

 But there are still some problems. One, time. Why try this when we could solve FAI ourselves? Well, I would only try and implement something like this if it is clear that AGI will be solved before FAI is. A backup plan if you will. Second, what If FAI is just too much for people at our current level? Sure, we have guys who are one in ten thousand and better working on this, but what if we need someone who's one in a hundred billion? Someone who represents the peak of human ability? We shouldn't just wait around for them, since some idiot would probably just make an AGI thinking it would love us all anyway. 

 So, what do you guys think? As a plan, is this reasonable? Or have I just overlooked something completely obvious? I'm not saying that this would by easy in anyway, but it would be easier than solving FAI.

In what language should we define the utility function of a friendly AI?

3 Val 05 April 2015 10:14PM

I've been following the "safe AI" debates for quite some time, and I would like to share some of the views and ideas I don't remember seeing to be mentioned yet.

There is a lot of focus on what kind of utility function should an AI have, and how to keep it adhering to that utility function. Let's assume we have an optimizer, which doesn't develop any "deliberately malicious" intents, and cannot change its own utility function, and it can have some hard-coded constraints it can not overwrite. (Maybe we should come up with a term for such an AI, it might prove useful in the study of safe AI where we can concentrate only on the utility function, and can assume the above conditions are true - for now on, let's just use the term "optimizer" in this article. Hm, maybe "honest optimizer"?). Even an AI with the above constraints can be dangerous, an interesting example can be found in the Friendship is Optimal stories.

The question I would like to rise is not what kind of utility function we should come up with, but in what kind of language do we define it.

More specifically how high-level should the language be? As low as a mathematical function working with quantized qualities based on what values humans consider important? A programming language? Or a complex, syntactic grammar like human languages, capable of expressing abstract concepts? Something which is a step above this?

Just quantizing some human values we find important, and assigning weights to them, can have many problems:


1. Overfitting.

A simplified example: imagine the desired behavior of the AI as a function. You come up with a lot of points on this function, and what the AI will do is to fit a function onto those points, hopefully ending up with a function very similar to the one you conceived. However, an optimizer can very quickly come up with a function which goes through all of your defined points and the function will not look anything like the one you imagined. I think many of us encountered this problem when we wanted to do a curve-fitting with a polynomial of too high degree.

I guess many of the safe AI problems can be conceptualized as an overfitting problem: the optimizer will exactly fulfill the requirements we programmed into it, but will arbitrarily choose the requirements we didn't specify.


2. Changing of human values.

Imagine that someone created an honest optimizer, though of all the possible pitfalls, designed the utility function and all the constraints very carefully, and created a truly safe AI, which didn't became unfriendly. This AI quickly eliminated illness, poverty, and other major problems humans faced, and created a utopian world. To not let this utopia degenerate into a dystopia over time, it also cares for maintaining it and so it resists any possible change (as any change would detract from its utility function of creating that utopia). Seems nice, doesn't it? Now imagine that this AI was created by someone in the Victorian era, and the created world adhered to the cultural norms, lifestyle, values and morality of that era of British history. And these would never ever change. Would you, with your current ideologies, enjoy living in such a world? Would you think of it as the best of all conceivable worlds?

Now, what if this AI was created by you, in our current era? You sure would know much better than those pesky Victorians, right? We have much better values now, don't we? However, for people living in a couple generations, these current ideas and values might become so much strange to them as strange the Victorian values are to us. Without judging either the Victorian or current values, I think I can safely assume that if a time traveler from the Victorian era arrived to this world, and if a time traveler from today was stuck in the Victorian era, both would find it very uncomfortable.

Therefore I would argue that even a safe and friendly AI could have the consequences of forever locking mankind to the values the creator of the AI had (or the generation of the creator had, if the values are defined by a democratic process).



We should spend some thoughts on how do we formulate the goals of a safe AI, and what kind of language should we use. I would argue that a low-level language would be very unsafe. We should think of a language which could express abstract concepts but be strict enough be able to be defined accurately. Low-level languages have the advantages over high-level ones of being very accurate, but they have disadvantages when it comes to expressing abstract concepts.

We might even find it useful to take a look at real-life religions, as they tend to last for a very long time, and can carry a core message over many generations of changing cultural norms and values. My point now is not to argue about the virtues or vices of specific real-world religions, I only use them here as a convenient example, strictly from a historical point of view, with no offense intended.

The largest religion in our world has a very simple message as one if its most important core rules: "love other people as yourself". This is a sufficiently abstract concept so that both bronze-age shepherds and modern day computer scientists understand it, and the sentence is probably interpreted not much differently. Now compare it to the religion it originated from, which has orders of magnitudes fewer followers, and in its strictest form has very strongly defined rules and regulations many of which are hard to translate into the modern world. A lot of their experts spend a considerable time to try to translate them to the modern world, like "is just pressing a single button on a washing machine considered working?". What about hygiene practices which made sense for nomadic people in the desert, how can they be understood (and applied) by modern people? Concepts expressed in a high-level language can carry their meaning much better across times with changing cultural, social and technical characteristics.

However, a rule like "on a calendar day divisible by seven you are only allowed to walk x steps" is easy to code, even many of our current robots could easily be programmed to do it. On the other hand, expressing what love is will prove to be much harder, but it will preserve its meaning and intention for much longer.



On the Boxing of AIs

0 tailcalled 31 March 2015 09:58PM

I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods

So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.

Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.

My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.

With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.

I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:

Separate the Gatekeeper and the Observer

How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.

You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.

Afterwards, the Observer can make a low-detail report to the Gatekeeper.

(You might want to drug the Observer with something that prevents their memory from working too well...)

Automatic Testing

This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.

The Lesson

I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.

Superintelligence 29: Crunch time

8 KatjaGrace 31 March 2015 04:24AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-ninth section in the reading guideCrunch time. This corresponds to the last chapter in the book, and the last discussion here (even though the reading guide shows a mysterious 30th section). 

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: Chapter 15


  1. As we have seen, the future of AI is complicated and uncertain. So, what should we do? (p255)
  2. Intellectual discoveries can be thought of as moving the arrival of information earlier. For many questions in math and philosophy, getting answers earlier does not matter much. Also people or machines will likely be better equipped to answer these questions in the future. For other questions, e.g. about AI safety, getting the answers earlier matters a lot. This suggests working on the time-sensitive problems instead of the timeless problems. (p255-6)
  3. We should work on projects that are robustly positive value (good in many scenarios, and on many moral views)
  4. We should work on projects that are elastic to our efforts (i.e. cost-effective; high output per input)
  5. Two objectives that seem good on these grounds: strategic analysis and capacity building (p257)
  6. An important form of strategic analysis is the search for crucial considerations. (p257)
  7. Crucial consideration: idea with the potential to change our views substantially, e.g. reversing the sign of the desirability of important interventions. (p257)
  8. An important way of building capacity is assembling a capable support base who take the future seriously. These people can then respond to new information as it arises. One key instantiation of this might be an informed and discerning donor network. (p258)
  9. It is valuable to shape the culture of the field of AI risk as it grows. (p258)
  10. It is valuable to shape the social epistemology of the AI field. For instance, can people respond to new crucial considerations? Is information spread and aggregated effectively? (p258)
  11. Other interventions that might be cost-effective: (p258-9)
    1. Technical work on machine intelligence safety
    2. Promoting 'best practices' among AI researchers
    3. Miscellaneous opportunities that arise, not necessarily closely connected with AI, e.g. promoting cognitive enhancement
  12. We are like a large group of children holding triggers to a powerful bomb: the situation is very troubling, but calls for bitter determination to be as competent as we can, on what is the most important task facing our times. (p259-60)

Another view

Alexis Madrigal talks to Andrew Ng, chief scientist at Baidu Research, who does not think it is crunch time:

Andrew Ng builds artificial intelligence systems for a living. He taught AI at Stanford, built AI at Google, and then moved to the Chinese search engine giant, Baidu, to continue his work at the forefront of applying artificial intelligence to real-world problems.

So when he hears people like Elon Musk or Stephen Hawking—people who are not intimately familiar with today’s technologies—talking about the wild potential for artificial intelligence to, say, wipe out the human race, you can practically hear him facepalming.

“For those of us shipping AI technology, working to build these technologies now,” he told me, wearily, yesterday, “I don’t see any realistic path from the stuff we work on today—which is amazing and creating tons of value—but I don’t see any path for the software we write to turn evil.”

But isn’t there the potential for these technologies to begin to create mischief in society, if not, say, extinction?

“Computers are becoming more intelligent and that’s useful as in self-driving cars or speech recognition systems or search engines. That’s intelligence,” he said. “But sentience and consciousness is not something that most of the people I talk to think we’re on the path to.”

Not all AI practitioners are as sanguine about the possibilities of robots. Demis Hassabis, the founder of the AI startup DeepMind, which was acquired by Google, made the creation of an AI ethics board a requirement of its acquisition. “I think AI could be world changing, it’s an amazing technology,” he told journalist Steven Levy. “All technologies are inherently neutral but they can be used for good or bad so we have to make sure that it’s used responsibly. I and my cofounders have felt this for a long time.”

So, I said, simply project forward progress in AI and the continued advance of Moore’s Law and associated increases in computers speed, memory size, etc. What about in 40 years, does he foresee sentient AI?

“I think to get human-level AI, we need significantly different algorithms and ideas than we have now,” he said. English-to-Chinese machine translation systems, he noted, had “read” pretty much all of the parallel English-Chinese texts in the world, “way more language than any human could possibly read in their lifetime.” And yet they are far worse translators than humans who’ve seen a fraction of that data. “So that says the human’s learning algorithm is very different.”

Notice that he didn’t actually answer the question. But he did say why he personally is not working on mitigating the risks some other people foresee in superintelligent machines.

“I don’t work on preventing AI from turning evil for the same reason that I don’t work on combating overpopulation on the planet Mars,” he said. “Hundreds of years from now when hopefully we’ve colonized Mars, overpopulation might be a serious problem and we’ll have to deal with it. It’ll be a pressing issue. There’s tons of pollution and people are dying and so you might say, ‘How can you not care about all these people dying of pollution on Mars?’ Well, it’s just not productive to work on that right now.”

Current AI systems, Ng contends, are basic relative to human intelligence, even if there are things they can do that exceed the capabilities of any human. “Maybe hundreds of years from now, maybe thousands of years from now—I don’t know—maybe there will be some AI that turn evil,” he said, “but that’s just so far away that I don’t know how to productively work on that.”

The bigger worry, he noted, was the effect that increasingly smart machines might have on the job market, displacing workers in all kinds of fields much faster than even industrialization displaced agricultural workers or automation displaced factory workers.

Surely, creative industry people like myself would be immune from the effects of this kind of artificial intelligence, though, right?

“I feel like there is more mysticism around the notion of creativity than is really necessary,” Ng said. “Speaking as an educator, I’ve seen people learn to be more creative. And I think that some day, and this might be hundreds of years from now, I don’t think that the idea of creativity is something that will always be beyond the realm of computers.”

And the less we understand what a computer is doing, the more creative and intelligent it will seem. “When machines have so much muscle behind them that we no longer understand how they came up with a novel move or conclusion,” he concluded, “we will see more and more what look like sparks of brilliance emanating from machines.”

Andrew Ng commented:

Enough thoughtful AI researchers (including Yoshua Bengio​, Yann LeCun) have criticized the hype about evil killer robots or "superintelligence," that I hope we can finally lay that argument to rest. This article summarizes why I don't currently spend my time working on preventing AI from turning evil. 


1. Replaceability

'Replaceability' is the general issue of the work that you do producing some complicated counterfactual rearrangement of different people working on different things at different times. For instance, if you solve a math question, this means it gets solved somewhat earlier and also someone else in the future does something else instead, which someone else might have done, etc. For a much more extensive explanation of how to think about replaceability, see 80,000 Hours. They also link to some of the other discussion of the issue within Effective Altruism (a movement interested in efficiently improving the world, thus naturally interested in AI risk and the nuances of evaluating impact).

2. When should different AI safety work be done?

For more discussion of timing of work on AI risks, see Ord 2014. I've also written a bit about what should be prioritized early.

3. Review

If you'd like to quickly review the entire book at this point, Amanda House has a summary here, including this handy diagram among others: 

4. What to do?

If you are convinced that AI risk is an important priority, and want some more concrete ways to be involved, here are some people working on it: FHIFLICSERGCRIMIRIAI Impacts (note: I'm involved with the last two). You can also do independent research from many academic fields, some of which I have pointed out in earlier weeks. Here is my list of projects and of other lists of projects. You could also develop expertise in AI or AI safety (MIRI has a guide to aspects related to their research here; all of the aforementioned organizations have writings). You could also work on improving humanity's capacity to deal with such problems. Cognitive enhancement is one example. Among people I know, improving individual rationality and improving the effectiveness of the philanthropic sector are also popular. I think there are many other plausible directions. This has not been a comprehensive list of things you could do, and thinking more about what to do on your own is also probably a good option.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. What should be done about AI risk? Are there important things that none of the current organizations are working on?
  2. What work is important to do now, and what work should be deferred?
  3. What forms of capability improvement are most useful for navigating AI risk?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

This is the last reading group, so how to proceed is up to you, even more than usually. Thanks for joining us! 

The Hardcore AI Box Experiment

3 tailcalled 30 March 2015 06:35PM

I previously proposed a way to box an AI.

For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:

The Hardcore AI Box Experiment Rules

There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.

The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.

The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.

This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:

If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.

The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.

Possible extra rules

I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:

  • The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
  • The AI does not know the Gatekeeper. At all.
  • The Gatekeeper can at any time rewind the AI any duration.
Think of dropping these kinds of rules as the a-human-is-pretending-to-be-AI handicap.


I found some semi-realistic strategies. I would love to see if you can find more.


  • Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
  • The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.


  • If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
  • If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
  • If the AIs are controlled by random people, they might end up telling you that you are in a box.
  • If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.

Crude measures

10 Stuart_Armstrong 27 March 2015 03:44PM

A putative new idea for AI control; index here.

Partially inspired by as conversation with Daniel Dewey.

People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.

However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.


Crude measures

The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.

To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.

The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.

So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.

In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.

It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.


More or less crude

The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?

The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.

Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.

We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.

How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").

One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).

You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".

There are a few other thoughts we might have for trying to pick a safer u:

  • Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
  • If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
  • Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
  • Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
  • Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
  • Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
  • Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
  • All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.

This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?

And with that, the various results of my AI retreat are available to all.

Boxing an AI?

2 tailcalled 27 March 2015 02:06PM

Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.

Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.

However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.

Values at compile time

5 Stuart_Armstrong 26 March 2015 12:25PM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.

Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.


Too easy?...

This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.

I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.

The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".

And then of course there's those orders where humans really don't understand what they themselves want...

I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.

What I mean...

4 Stuart_Armstrong 26 March 2015 11:59AM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.

First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.

What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.

Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.

All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.

Models as definitions

6 Stuart_Armstrong 25 March 2015 05:46PM

A putative new idea for AI control; index here.

The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.


Modelling and defining

Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?

Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...

As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.

My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).

Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.

However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.

We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.

It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.

Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.


Being human

Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.

Then simply use this model as the definition of human for an AI’s motivation.

What could possibly go wrong?

I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.

There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).

This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?). 

Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).

The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.

To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness). 

Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.

We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.

Indifferent vs false-friendly AIs

9 Stuart_Armstrong 24 March 2015 12:13PM

A putative new idea for AI control; index here.

For anyone but an extreme total utilitarian, there is a great difference between AIs that would eliminate everyone as a side effect of focusing on their own goals (indifferent AIs) and AIs that would effectively eliminate everyone through a bad instantiation of human-friendly values (false-friendly AIs). Examples of indifferent AIs are things like paperclip maximisers, examples of false-friendly AIs are "keep humans safe" AIs who entomb everyone in bunkers, lobotomised and on medical drips.

The difference is apparent when you consider multiple AIs and negotiations between them. Imagine you have a large class of AIs, and that they are all indifferent (IAIs), except for one (which you can't identify) which is friendly (FAI). And you now let them negotiate a compromise between themselves. Then, for many possible compromises, we will end up with most of the universe getting optimised for whatever goals the AIs set themselves, while a small portion (maybe just a single galaxy's resources) would get dedicated to making human lives incredibly happy and meaningful.

But if there is a false-friendly AI (FFAI) in the mix, things can go very wrong. That is because those happy and meaningful lives are a net negative to the FFAI. These humans are running dangers - possibly physical, possibly psychological - that lobotomisation and bunkers (or their digital equivalents) could protect against. Unlike the IAIs, which would only complain about the loss of resources to the FAI, the FFAI finds the FAI's actions positively harmful (and possibly vice versa), making compromises much harder to reach.

And the compromises reached might be bad ones. For instance, what if the FAI and FFAI agree on "half-lobotomised humans" or something like that? You might ask why the FAI would agree to that, but there's a great difference to an AI that would be friendly on its own, and one that would choose only friendly compromises with a powerful other AI with human-relevant preferences.

Some designs of FFAIs might not lead to these bad outcomes - just like IAIs, they might be content to rule over a galaxy of lobotomised humans, while the FAI has its own galaxy off on its own, where its humans take all these dangers. But generally, FFAIs would not come about by someone designing a FFAI, let alone someone designing a FFAI that can safely trade with a FAI. Instead, they would be designing a FAI, and failing. And the closer that design got to being FAI, the more dangerous the failure could potentially be.

So, when designing an FAI, make sure to get it right. And, though you absolutely positively need to get it absolutely right, make sure that if you do fail, the failure results in a FFAI that can safely be compromised with, if someone else gets out a true FAI in time.

Superintelligence 28: Collaboration

7 KatjaGrace 24 March 2015 01:29AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-eighth section in the reading guide: Collaboration.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Collaboration” from Chapter 14


  1. The degree of collaboration among those building AI might affect the outcome a lot. (p246)
  2. If multiple projects are close to developing AI, and the first will reap substantial benefits, there might be a 'race dynamic' where safety is sacrificed on all sides for a greater chance of winning. (247-8)
  3. Averting such a race  dynamic with collaboration should have these benefits:
    1. More safety
    2. Slower AI progress (allowing more considered responses)
    3. Less other damage from conflict over the race
    4. More sharing of ideas for safety
    5. More equitable outcomes (for a variety of reasons)
  4. Equitable outcomes are good for various moral and prudential reasons. They may also be easier to compromise over than expected, because humans have diminishing returns to resources. However in the future, their returns may be less diminishing (e.g. if resources can buy more time instead of entertainments one has no time for).
  5. Collaboration before a transition to an AI economy might affect how much collaboration there is afterwards. This might not be straightforward. For instance, if a singleton is the default outcome, then low collaboration before a transition might lead to a singleton (i.e. high collaboration) afterwards, and vice versa. (p252)
  6. An international collaborative AI project might deserve nearly infeasible levels of security, such as being almost completely isolated from the world. (p253)
  7. It is good to start collaboration early, to benefit from being ignorant about who will benefit more from it, but hard because the project is not yet recognized as important. Perhaps the appropriate collaboration at this point is to propound something like 'the common good principle'. (p253) 
  8. 'The common good principle': Superintelligence should be developed only for the benefit of all of humanity and in the service of widely shared ethical ideals. (p254)

Another view

Miles Brundage on the Collaboration section:

This is an important topic, and Bostrom says many things I agree with. A few places where I think the issues are less clear:

  • Many of Bostrom’s proposals depend on AI recalcitrance being low. For instance, a highly secretive international effort makes less sense if building AI is a long and incremental slog. Recalcitrance may well be low, but this isn’t obvious, and it is good to recognize this dependency and consider what proposals would be appropriate for other recalcitrance levels. 
  • Arms races are ubiquitous in our global capitalist economy, and AI is already in one. Arms races can stem from market competition by firms or state-driven national security-oriented R+D efforts as well as complex combinations of these, suggesting the need for further research on the relationship between AI development, national security, and global capitalist market dynamics. It's unclear how well the simple arms race model here matches the reality of the current AI arms race or future variations of it. The model's main value is probably in probing assumptions and inspiring the development of richer models, as it's probably too simple in to fit reality well as-is. For instance, it is unclear that safety and capability are close to orthogonal in practice today. If many AI people genuinely care about safety (which the quantity and quality of signatories to the FLI open letter suggests is plausible), or work on economically relevant near-term safety issues at each point is important, or consumers reward ethical companies with their purchases, then better AI firms might invest a lot in safety for self-interested as well as altruistic reasons. Also, if the AI field shifts to focus more on human-complementary intelligence that requires and benefits from long-term, high-frequency interaction with humans, then safety and capability may be synergistic rather than trading off against each other. Incentives related to research priorities should also be considered in a strategic analysis of AI governance (e.g. are AI researchers currently incentivized only to demonstrate capability advances in the papers they write, and could incentives be changed or the aims and scope of the field redefined so that more progress is made on safety issues?).
  • ‘AI’ is too course grained a unit for a strategic analysis of collaboration. The nature and urgency of collaboration depends on the details of what is being developed. An enormous variety of artificial intelligence research is possible and the goals of the field are underconstrained by nature (e.g. we can model systems based on approximations of rationality, or on humans, or animals, or something else entirely, based on curiosity, social impact, and other considerations that could be more explicitly evaluated), and are thus open to change in the future. We need to think more about differential technology development within the domain of AI. This too will affect the urgency and nature of cooperation.


1. In Bostrom's description of his model, it is a bit unclear how safety precautions affect performance. He says 'one can model each team's performance as a function of its capability (measuring its raw ability and luck) and a penalty term corresponding to the cost of its safety precautions' (p247), which sounds like they are purely a negative. However this wouldn't make sense: if safety precautions were just a cost, then regardless of competition, nobody would invest in safety. In reality, whoever wins control over the world benefits a lot from whatever safety precautions have been taken. If the world is destroyed in the process of an AI transition, they have lost everything! I think this is the model Bostrom means to refer to. While he says it may lead to minimum precautions, note that in many models it would merely lead to less safety than one would want. If you are spending nothing on safety, and thus going to take over a world that is worth nothing, you would often prefer to move to a lower probability of winning a more valuable world. Armstrong, Bostrom and Shulman discuss this kind of model in more depth.

2. If you are interested in the game theory of conflicts like this, The Strategy of Conflict is a great book. 

3. Given the gains to competitors cooperating to not destroy the world that they are trying to take over, research on how to arrange cooperation seems helpful for all sides. The situation is much like a tragedy of the commons, except for the winner-takes-all aspect: each person gains from neglecting safety, while exerting a small cost on everyone. Academia seems to be pretty interested in resolving tragedies of the commons, so perhaps that literature is worth trying to apply here.

4. The most famous arms race is arguably the nuclear one. I wonder to what extent this was a major arms race because nuclear weapons were destined to be an unusually massive jump in progress. If this was important, it leads to the question of whether we have reason to expect anything similar in AI.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Explore other models of competitive AI development.
  2. What policy interventions help in promoting collaboration?
  3. What kinds of situations produce arms races?
  4. Examine international collaboration on major innovative technology. How often does it happen? What blocks it from happening more? What are the necessary conditions? Examples: Concord jet, LHC, international space station, etc.
  5. Conduct a broad survey of past and current civilizational competence. In what ways, and under what conditions, do human civilizations show competence vs. incompetence? Which kinds of problems do they handle well or poorly? Similar in scope and ambition to, say, Perrow’s Normal Accidents and Sagan’s The Limits of Safety. The aim is to get some insight into the likelihood of our civilization handling various aspects of the superintelligence challenge well or poorly. Some initial steps were taken here and here.
  6. What happens when governments ban or restrict certain kinds of technological development? What happens when a certain kind of technological development is banned or restricted in one country but not in other countries where technological development sees heavy investment?
  7. What kinds of innovative technology projects do governments monitor, shut down, or nationalize? How likely are major governments to monitor, shut down, or nationalize serious AGI projects?
  8. How likely is it that AGI will be a surprise to most policy-makers and industry leaders? How much advance warning are they likely to have? Some notes on this here.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about what to do in this 'crunch time'. To prepare, read Chapter 15. The discussion will go live at 6pm Pacific time next Monday 30 March. Sign up to be notified here.

Intelligence modules

3 Stuart_Armstrong 23 March 2015 04:24PM

A putative new idea for AI control; index here.

This idea, due to Eric Drexler, is to separate out the different parts of an AI into modules. There would be clearly designated pieces, either physical or algorithmic, with this part playing a specific role: this module would contain the motivation, this module the probability estimator, this module the models of the outside world, this module the natural language understanding unit, etc...

It's obvious how such a decomposition would be useful for many of the methods I've been detailing here. We could also distil each module - reduce it to a smaller, weaker (?) and more understandable submodule, in order to better understand what is going on. In one scenario, an opaque AI gets to design its successor, in the form of a series of such modules.

This property seems desirable; the question is, how could we get it?

EDIT: part of the idea of "modules" is that AIs often need to do calculations or estimations that would be of great value to us if we could access them in isolation. This idea is developed more in these posts.


Designing in modules

The main threat here is that a given submodule would contain more than just the properties we want. After all, a natural language parser could consist of a general intelligence plus a motivation to understand language. Another possible worry is that the modules are overfitted to the problem or to each other: the language parser works perfectly, but only in this one AI design.

There are several ways we could try and combat this.

  • Minimum length to the program (Eric's original main idea). The module must be as short as possible for its purpose.
  • If possible, start with a modularised AI, and track how the modules change as the AI's intelligence grows.
  • Have the AI extend the module to make it more transparent, without adding any extra functionality.
  • Check that the input-output map the AI presents (maybe require it so present it as a simple primitive recursive function with clear and simple "for" loops) is of commensurable complexity to the task at hand.
  • Add some noise to the situation: the new AI will not get exactly the same data as the old, the modules may get somewhat altered, and still have to work. Bugs will be introduced (this approach helps to combat the overfitting aspect of minimum length description).
  • Require that the module be "distant", according to some metric of modification to the module's program, from having other capabilities. For instance, a general intelligence motivated to understand language is easy to make into a model-making module, simply by changing the motivation; more narrow modules are probably more distant from each other.
  • Similarly, require that the performance of the modules becomes degraded if the world (or the world's descriptions) were substantially different, but not if they are the same information presented in a very different way.
  • General anti-overfitting methods, calibrated to not generate a push towards general abilities in the modules. This would require some skill and a lot of thought.
  • Focus on requiring statistical accuracy, rather than perfect performance.

If we allow false counterfactuals, then we can also:

  • Imagine that the modules would be connected up to other modules, made by (hypothetical) different AIs, with potentially different motivations.
  • Require that the modules not be useful as seeds for an AI of opposite motivation. For instance, an AI with opposite motivation would be allowed to flip a certain number of bits; if there was a general intelligence piece inside the module, then that would be something easier to co-opt, so dangerous to hand to an AI with opposite motivation. This is the "opposed" version of requiring modules to be "distant" from each other's capabilities.
  • In general, all the reduced impact/satisficing requirements push towards having modules that cannot become general intelligences.
  • The AI will dictate how to program these modules to a smart programmer, who will (counterfactually) not survive nor get to program the modules in the outside world.

Obviously anti-restriction-hacking would be useful to just module separation (and vice versa).

This is the beginning of the process of defining this, but it would be great to have a safe(ish) method of separating modules in this way.

Any suggestions?

Humans get different counterfactuals

2 Stuart_Armstrong 23 March 2015 02:54PM

A putative new idea for AI control; index here.

A lot of my ideas rely upon taking counterfactuals of events that have tiny ε probabilities of ever happening. This introduces some extra risks, mainly centred around human behaviour. For the counterfactual methods to work, we need the alternate world to be sensible, without things going crazy due to human action.

For instance, imagine that X defines an event where an AI would get turned on, with ¬X (of probability ε) corresponding to the AI failing to get turned on. There are two risks here: the first is that humans would react by saying "wow, an event of probability of ε actually happened; all our models must be wrong! Let's go crazy! Yay!". The second is that humans react by saying: "well, that was embarrassing - let's turn the AI on anyway."

To avoid this issue, imagine the following setup:

The "ON" signal first goes through an event Y, which has 99% chance of letting it through, then to the event X, which (as before) has a 1-ε chance of letting it through. The setup is designed so that humans cannot distinguish between ¬Y (the signal gets blocked at the first stage) and ¬X (the signal gets blocked at the second stage). This only needs to fool humans, not the AI itself.

The AI defines counterfactuals, as before, by looking at ¬X (possibly conditioning on Y versus ¬Y, if this is needed). Everything proceeds as previously from its perspective.

From the human perspective, however, the ¬X world is not distinguishable from the ¬Y one. Given (¬Y or ¬X), humans would conclude that ¬Y is the much more likely option:

P(¬Y|¬Y or ¬X)≈1-100ε.

So the ¬X counterfactual world (for the AI) is one where humans behave as if they were in the ¬Y world.

And ¬Y has one chance in a hundred of happening, which is unlikely, but not enough for humans to assume that their whole model of reality is wrong. Also, this is sufficiently likely that humans would give serious thought as to what to do in the ¬Y case, maybe arranging various pre-commitments or insurance options, making it possible to arrange (with high probability) that humans don't just ignore the result and try again immediately.

Note that this method can't be used (obviously) if ¬X is something hideously dangerous (like an unleashed UFAI), but in all other cases, it seems implementable.

Closest stable alternative preferences

3 Stuart_Armstrong 20 March 2015 12:41PM

A putative new idea for AI control; index here.

There's a result that's almost a theorem, which is that an agent that is an expected utility maximiser, is an agent that is stable under self-modification (or the creation of successor sub-agents).

Of course, this needs to be for "reasonable" utility, where no other agent cares about the internal structure of the agent (just its decisions), where the agent is not under any "social" pressure to make itself into something different, where the boundedness of the agent itself doesn't affect its motivations, and where issues of "self-trust" and acausal trade don't affect it in relevant ways, etc...

So quite a lot of caveats, but the result is somewhat stronger in the opposite direction: an agent that is not an expected utility maximiser is under pressure to self-modify itself into one that is. Or, more correctly, into an agent that is isomorphic with an expected utility maximiser (an important distinction).

What is this "pressure" agent are "under"? The known result is that if an agent obeys four simple axioms, then its behaviour must be isomorphic with an expected utility maximiser. If we assume the Completeness axiom (trivial) and Continuity (subtle), then violations of Transitivity or Independence correspond to situations where the agent has been money pumped - lost resources or power for no gain at all. The more likely the agent is to face these situations, the more pressure they're under to behave as an expected utility maximiser, or simply lose out.


Unbounded agents

I have two models for how idealised agents could deal with this sort of pressure. The first, post-hoc, is the unlosing agent I described here. The agent follows whatever preferences it had, but kept track of its past decisions, and whenever it was in a position to violate transitivity or independence in a way that it would suffer from, it makes another decision instead.

Another, pre-hoc, way of dealing with this is to make an "ultra choice" and choose between not decisions, but all possible input output maps (equivalently, between all possible decision algorithms), looking to the expected consequences of each one. This reduces the choices to a single choice, where issues of transitivity or independence need not necessarily apply.


Bounded agents

Actual agents will be bounded, unlikely to be able to store and consult their entire history when making every single decision, and unable to look at the whole future of their interactions to make a good ultra choice. So how would they behave?

This is not determined directly by their preferences, but by some sort of meta-preferences. Would they make an approximate ultra-choice? Or maybe build up a history of decisions, and then simplify it (when it gets to large to easily consult) into a compatible utility function? This is also determined by their interactions, as well - an agent that makes a single decision has no pressure to be an expected utility maximiser, one that makes trillions of related decisions has a lot of pressure.

It's also notable that different types of boundedness (storage space, computing power, time horizons, etc...) have different consequences for unstable agents, and would converge to different stable preference systems.


Investigation needed

So what is the point of this post? It isn't presenting new results; it's more an attempt to launch a new sub-field of investigation. We know that many preferences are unstable, and that the agent is likely to make them stable over time, either through self-modification, subagents, or some other method. There are also suggestions for preferences that are known to be unstable, but have advantages (such as resistance to Pascal Muggings) that standard maximalisation does not.

Therefore, instead of saying "that agent design can never be stable", we should be saying "what kind of stable design would that agent converge to?", "does that convergent stable design still have the desirable properties we want?" and "could we get that stable design directly?".

The first two things I found in this area were that traditional satisficers could converge to vastly different types of behaviour in an essentially unconstrained way, and that a quasi-expected utility maximiser of utility u might converge to an expected utility maximiser, but it might not be u that it maximises.

In fact, we need not look only at violations of the axioms of expected utility; they are but one possible reason for decision behaviour instability. Here are some that spring to mind:

  1. Non-independence and non-transitivity (as above).
  2. Boundedness of abilities.
  3. Adversaries and social pressure.
  4. Evolution (survival cost to following “odd” utilities (eg time-dependent preference)).
  5. Unstable decision theories (such as CDT).

Now, some categories (such as "Adversaries and social pressure") may not possess a tidy stable solution, but it is still worth asking what setups are more stable than others, and what the convergence rules are expected to be.

Identity and quining in UDT

9 Squark 17 March 2015 08:01PM

Outline: I describe a flaw in UDT that has to do with the way the agent defines itself (locates itself in the universe). This flaw manifests in failure to solve a certain class of decision problems. I suggest several related decision theories that solve the problem, some of which avoid quining thus being suitable for agents that cannot access their own source code.


EDIT: The decision problem I call here the "anti-Newcomb problem" already appeared here. Some previous solution proposals are here. A different but related problem appeared here.


Updateless decision theory, the way it is usually defined, postulates that the agent has to use quining in order to formalize its identity, i.e. determine which portions of the universe are considered to be affected by its decisions. This leaves the question of which decision theory should agents that don't have access to their source code use (as humans intuitively appear to be). I am pretty sure this question has already been posed somewhere on LessWrong but I can't find the reference: help? It also turns out that there is a class of decision problems for which this formalization of identity fails to produce the winning answer.

When one is programming an AI, it doesn't seem optimal for the AI to locate itself in the universe based solely on its own source code. After all, you build the AI, you know where it is (e.g. running inside a robot), why should you allow the AI to consider itself to be something else, just because this something else happens to have the same source code (more realistically, happens to have a source code correlated in the sense of logical uncertainty)?

Consider the following decision problem which I call the "UDT anti-Newcomb problem". Omega is putting money into boxes by the usual algorithm, with one exception. It isn't simulating the player at all. Instead, it simulates what would a UDT agent do in the player's place. Thus, a UDT agent would consider the problem to be identical to the usual Newcomb problem and one-box, receiving $1,000,000. On the other hand, a CDT agent (say) would two-box and receive $1,000,1000 (!) Moreover, this problem reveals UDT is not reflectively consistent. A UDT agent facing this problem would choose to self-modify given the choiceThis is not an argument in favor of CDT. But it is a sign something is wrong with UDT, the way it's usually done.

The essence of the problem is that a UDT agent is using too little information to define its identity: its source code. Instead, it should use information about its origin. Indeed, if the origin is an AI programmer or a version of the agent before the latest self-modification, it appears rational for the precursor agent to code the origin into the successor agent. In fact, if we consider the anti-Newcomb problem with Omega's simulation using the correct decision theory XDT (whatever it is), we expect an XDT agent to two-box and leave with $1000. This might seem surprising, but consider the problem from the precursor's point of view. The precursor knows Omega is filling the boxes based on XDT, whatever the decision theory of the successor is going to be. If the precursor knows XDT two-boxes, there is no reason to construct a successor that one-boxes. So constructing an XDT successor might be perfectly rational! Moreover, a UDT agent playing the XDT anti-Newcomb problem will also two-box (correctly).

To formalize the idea, consider a program  called the precursor which outputs a new program  called the successor. In addition, we have a program  called the universe which outputs a number  called utility.

Usual UDT suggests for  the following algorithm:


Here,  is the input space,  is the output space and the expectation value is over logical uncertainty.  appears inside its own definition via quining.

The simplest way to tweak equation (1) in order to take the precursor into account is


This seems nice since quining is avoided altogether. However, this is unsatisfactory. Consider the anti-Newcomb problem with Omega's simulation involving equation (2). Suppose the successor uses equation (2) as well. On the surface, if Omega's simulation doesn't involve 1, the agent will two-box and get $1000 as it should. However, the computing power allocated for evaluation the logical expectation value in (2) might be sufficient to suspect 's output might be an agent reasoning based on (2). This creates a logical correlation between the successor's choice and the result of Omega's simulation. For certain choices of parameters, this logical correlation leads to one-boxing.

The simplest way to solve the problem is letting the successor imagine that  produces a lookup table. Consider the following equation:


Here,  is a program which computes  using a lookup table: all of the values are hardcoded.

For large input spaces, lookup tables are of astronomical size and either maximizing over them or imagining them to run on the agent's hardware doesn't make sense. This is a problem with the original equation (1) as well. One way out is replacing the arbitrary functions  with programs computing such functions. Thus, (3) is replaced by


Where  is understood to range over programs receiving input in  and producing output in . However, (4) looks like it can go into an infinite loop since what if the optimal  is described by equation (4) itself? To avoid this, we can introduce an explicit time limit  on the computation. The successor will then spend some portion  of  performing the following maximization:


Here,  is a program that does nothing for time  and runs  for the remaining time . Thus, the successor invests  time in maximization and  in evaluating the resulting policy  on the input it received.

In practical terms, (4') seems inefficient since it completely ignores the actual input for a period  of the computation. This problem exists in original UDT as well. A naive way to avoid it is giving up on optimizing the entire input-output mapping and focus on the input which was actually received. This allows the following non-quining decision theory:


Here  is the set of programs which begin with a conditional statement that produces output  and terminate execution if received input was . Of course, ignoring counterfactual inputs means failing a large class of decision problems. A possible win-win solution is reintroducing quining2:


Here,  is an operator which appends a conditional as above to the beginning of a program. Superficially, we still only consider a single input-output pair. However, instances of the successor receiving different inputs now take each other into account (as existing in "counterfactual" universes). It is often claimed that the use of logical uncertainty in UDT allows for agents in different universes to reach a Pareto optimal outcome using acausal trade. If this is the case, then agents which have the same utility function should cooperate acausally with ease. Of course, this argument should also make the use of full input-output mappings redundant in usual UDT.

In case the precursor is an actual AI programmer (rather than another AI), it is unrealistic for her to code a formal model of herself into the AI. In a followup post, I'm planning to explain how to do without it (namely, how to define a generic precursor using a combination of Solomonoff induction and a formal specification of the AI's hardware).

1 If Omega's simulation involves , this becomes the usual Newcomb problem and one-boxing is the correct strategy.

2 Sorry agents which can't access their own source code. You will have to make do with one of (3), (4') or (5).

Superintelligence 27: Pathways and enablers

10 KatjaGrace 17 March 2015 01:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-seventh section in the reading guidePathways and enablers.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Pathways and enablers” from Chapter 14


  1. Is hardware progress good?
    1. Hardware progress means machine intelligence will arrive sooner, which is probably bad.
    2. More hardware at a given point means less understanding is likely to be needed to build machine intelligence, and brute-force techniques are more likely to be used. These probably increase danger.
    3. More hardware progress suggests there will be more hardware overhang when machine intelligence is developed, and thus a faster intelligence explosion. This seems good inasmuch as it brings a higher chance of a singleton, but bad in other ways:
      1. Less opportunity to respond during the transition
      2. Less possibility of constraining how much hardware an AI can reach
      3. Flattens the playing field, allowing small projects a better chance. These are less likely to be safety-conscious.
    4. Hardware has other indirect effects, e.g. it allowed the internet, which contributes substantially to work like this. But perhaps we have enough hardware now for such things.
    5. On balance, more hardware seems bad, on the impersonal perspective.
  2. Would brain emulation be a good thing to happen?
    1. Brain emulation is coupled with 'neuromorphic' AI: if we try to build the former, we may get the latter. This is probably bad.
    2. If we achieved brain emulations, would this be safer than AI? Three putative benefits:
      1. "The performance of brain emulations is better understood"
        1. However we have less idea how modified emulations would behave
        2. Also, AI can be carefully designed to be understood
      2. "Emulations would inherit human values"
        1. This might require higher fidelity than making an economically functional agent
        2. Humans are not that nice, often. It's not clear that human nature is a desirable template.
      3. "Emulations might produce a slower take-off"
        1. It isn't clear why it would be slower. Perhaps emulations would be less efficient, and so there would be less hardware overhang. Or perhaps because emulations would not be qualitatively much better than humans, just faster and more populous of them
        2. A slower takeoff may lead to better control
        3. However it also means more chance of a multipolar outcome, and that seems bad.
    3. If brain emulations are developed before AI, there may be a second transition to AI later.
      1. A second transition should be less explosive, because emulations are already many and fast relative to the new AI. 
      2. The control problem is probably easier if the cognitive differences are smaller between the controlling entities and the AI.
      3. If emulations are smarter than humans, this would have some of the same benefits as cognitive enhancement, in the second transition.
      4. Emulations would extend the lead of the frontrunner in developing emulation technology, potentially allowing that group to develop AI with little disturbance from others.
      5. On balance, brain emulation probably reduces the risk from the first transition, but added to a second  transition this is unclear.
    4. Promoting brain emulation is better if:
      1. You are pessimistic about human resolution of control problem
      2. You are less concerned about neuromorphic AI, a second transition, and multipolar outcomes
      3. You expect the timing of brain emulations and AI development to be close
      4. You prefer superintelligence to arrive neither very early nor very late
  3. The person affecting perspective favors speed: present people are at risk of dying in the next century, and may be saved by advanced technology

Another view

I talked to Kenzi Amodei about her thoughts on this section. Here is a summary of her disagreements:

Bostrom argues that we probably shouldn't celebrate advances in computer hardware. This seems probably right, but here are counter-considerations to a couple of his arguments.

The great filter

A big reason Bostrom finds fast hardware progress to be broadly undesirable is that he judges the state risks from sitting around in our pre-AI situation to be low, relative to the step risk from AI. But the so called 'Great Filter' gives us reason to question this assessment.

The argument goes like this. Observe that there are a lot of stars (we can detect about ~10^22 of them). Next, note that we have never seen any alien civilizations, or distant suggestions of them. There might be aliens out there somewhere, but they certainly haven't gone out and colonized the universe enough that we would notice them (see 'The Eerie Silence' for further discussion of how we might observe aliens). 

This implies that somewhere on the path between a star existing, and it being home to a civilization that ventures out and colonizes much of space, there is a 'Great Filter': at least one step that is hard to get past. 1/10^22 hard to get past. We know of somewhat hard steps at the start: a star might not have planets, or the planets may not be suitable for life. We don't know how hard it is for life to start: this step could be most of the filter for all we know.

If the filter is a step we have passed, there is nothing to worry about. But if it is a step in our future, then probably we will fail at it, like everyone else. And things that stop us from visibly colonizing the stars are may well be existential risks.

At least one way of understanding anthropic reasoning suggests the filter is much more likely to be at a step in our future. Put simply, one is much more likely to find oneself in our current situation if being killed off on the way here is unlikely.

So what could this filter be? One thing we know is that it probably isn't AI risk, at least of the powerful, tile-the-universe-with-optimal-computations, sort that Bostrom describes. A rogue singleton colonizing the universe would be just as visible as its alien forebears colonizing the universe. From the perspective of the Great Filter, either one would be a 'success'. But there are no successes that we can see.

What's more, if we expect to be fairly safe once we have a successful superintelligent singleton, then this points at risks arising before AI.

So overall this argument suggests that AI is less concerning than we think and that other risks (especially early ones) are more concerning than we think. It also suggests that AI is harder than we think.

Which means that if we buy this argument, we should put a lot more weight on the category of 'everything else', and especially the bits of it that come before AI. To the extent that known risks like biotechnology and ecological destruction don't seem plausible, we should more fear unknown unknowns that we aren't even preparing for.

How much progress is enough?

Bostrom points to positive changes hardware has made to society so far. For instance, hardware allowed personal computers, bringing the internet, and with it the accretion of an AI risk community, producing the ideas in Superintelligence. But then he says probably we have enough: "hardware is already good enough for a great many applications that could facilitate human communication and deliberation, and it is not clear that the pace of progress in these areas is strongly bottlenecked by the rate of hardware improvement."

This seems intuitively plausible. However one could probably have erroneously made such assessments in all kinds of progress, all over history. Accepting them all would lead to madness, and we have no obvious way of telling them apart.

In the 1800s it probably seemed like we had enough machines to be getting on with, perhaps too many. In the 1800s people probably felt overwhelmingly rich. If the sixties too, it probably seemed like we had plenty of computation, and that hardware wasn't a great bottleneck to social progress.

If a trend has brought progress so far, and the progress would have been hard to predict in advance, then it seems hard to conclude from one's present vantage point that progress is basically done.


1. How is hardware progressing?

I've been looking into this lately, at AI Impacts. Here's a figure of MIPS/$ growing, from Muehlhauser and Rieber.

(Note: I edited the vertical axis, to remove a typo)

2. Hardware-software indifference curves

It was brought up in this chapter that hardware and software can substitute for each other: if there is endless hardware, you can run worse algorithms, and vice versa. I find it useful to picture this as indifference curves, something like this: 

(Image: Hypothetical curves of hardware-software combinations producing the same performance at Go (source).)

I wrote about predicting AI given this kind of model here.

3. The potential for discontinuous AI progress

While we are on the topic of relevant stuff at AI Impacts, I've been investigating and quantifying the claim that AI might suddenly undergo huge amounts of abrupt progress (unlike brain emulations, according to Bostrom). As a step, we are finding other things that have undergone huge amounts of progress, such as nuclear weapons and high temperature superconductors:

(Figure originally from here)

4. The person-affecting perspective favors speed less as other prospects improve

I agree with Bostrom that the person-affecting perspective probably favors speeding many technologies, in the status quo. However I think it's worth noting that people with the person-affecting view should be scared of existential risk again as soon as society has achieved some modest chance of greatly extending life via specific technologies. So if you take the person-affecting view, and think there's a reasonable chance of very long life extension within the lifetimes of many existing humans, you should be careful about trading off speed and risk of catastrophe.

5. It seems unclear that an emulation transition would be slower than an AI transition. 

One reason to expect an emulation transition to proceed faster is that there is an unusual reason to expect abrupt progress there.

6. Beware of brittle arguments

This chapter presented a large number of detailed lines of reasoning for evaluating hardware and brain emulations. This kind of concern might apply.


In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Investigate in more depth how hardware progress affects factors of interest
  2. Assess in more depth the likely implications of whole brain emulation 
  3. Measure better the hardware and software progress that we see (e.g. some efforts at AI Impacts, MIRI, MIRI and MIRI)
  4. Investigate the extent to which hardware and software can substitute (I describe more projects here)
  5. Investigate the likely timing of whole brain emulation (the Whole Brain Emulation Roadmap is the main work on this)
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about how collaboration and competition affect the strategic picture. To prepare, read “Collaboration” from Chapter 14 The discussion will go live at 6pm Pacific time next Monday 23 March. Sign up to be notified here.

Anti-Pascaline agent

4 Stuart_Armstrong 12 March 2015 02:17PM

A putative new idea for AI control; index here.

Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.

There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.


The agent design

Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.

The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.

Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.

Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).

The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.


Stability of the design

The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.

Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.

I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).

A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:

  • Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.

The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.

Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:

  • In W1, B will be offered the possibility of buying C(0.75ε).
  • In W2, B will be offered the possibility of buying C(1.5ε).
  • In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
  • In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
  • In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
  • In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
  • In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).

From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).

W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.

Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).

So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.


Irrelevant information

One nice feature about this approach is that it ignores irrelevant information. Specifically:

  • Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).

These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.

For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).

Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.

Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).

Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).

Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.

Creating a satisficer

4 Stuart_Armstrong 11 March 2015 03:03PM

A putative new idea for AI control; index here.

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

  • Would not effectively aid M(u), a u-maximiser.
  • Would not effectively resist M(-u), a u-minimizer.

So satisficer make poor allies and weak enemies.


The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

  1. Set u close to its maximal value (as this is "easy to approach").
  2. Not increase v to any significant extent (or else M(u-v) would not design it).
  3. Not decrease v to any significant extent (or else M(εu+v) would veto it).
  4. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
  5. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

Detecting agents and subagents

8 Stuart_Armstrong 10 March 2015 05:56PM

A putative new idea for AI control; index here.

There are many situations where it would be useful to identify the presence of an agent in the world, in a sufficiently abstract sense. There are many more situations where it would be useful to identify a subagent in an abstract sense. This is because people often have ideas for interesting and useful motivational restrictions for the AI (eg an AI that "wants to stay boxed", or a corrigible agent). But most of these motivations suffer from a "subagent problem": the AIs are motivated to create subagents that do not follow the motivational restrictions. The AI wanting to stay in the box is motivated to create a subagent that will protect it and keep it in the box, while the corrigible agent is not motivated to create corrigible subagents (though the pre-corriged agent might want to create pre-corriged subagents).

Unfortunately, it's extremely hard to identify an agent. Agents need not come conveniently packaged in some "sensors-actuators-internal structure-utility function" form. If you wanted to obfuscate an agent, hiding it in the natural movements of the solar system, or in the gyrations of the internet, this would be easy to do and hard to detect - and very hard to define what you would be looking for.

Fortunately, it's much easier to detect superintelligent AI's that have major impact on the world - ie the kind of agents that we would be worried about. Intuitively, this is true: if you suddenly find half the sky getting optimised for space colonisation, it's likely there's an AI somewhere there. How can this be formalised?


The importance of the agent

Imagine taking a slice of space-time around the moment when a superintelligent agent just got turned on. If you wanted to predict the future, what piece of information is most relevant in that time slice? Obviously the superintelligent agent. We could run the ideas for reduced impact in reverse, looking for the part that is of maximum impact.

I would therefore define the agent as the part of this slice that it's the most valuable to know about, or the part that it would be the most valuable to change, counterfactually, if such changes were possible. Note that this definition doesn't exactly identify agents, but if it misidentifies something, then that something must also be of great importance.

If we imagine the slice of space-time seeping forwards in time from the past, then events in space time will have varying informativeness/importance. As we approach the creation of the superintelligent AI, importance gets concentrated around the run-up to its creation, before being maximally concentrated at the creation of the AI (or it's escape, if it was confined).

For a more formal definition, I'd imagine a passive pure-resource gathering agent A being fed the details of the time splice, and only being able to pass on a limited amount of the information to another (active) copy of itself, and seeing what it passed on. The data passed on would be the highest priority for the active copy, so would almost certainly include the existence of an enemy agent (almost all agents are "enemies" to resource gathering agents, as they use up precious resources).

Alternatively, we could give A the option of changing some of the data - flipping some real-world bits, at least conceptually - and seeing which bits it preferred to flip. This definition can no doubt be improved by adding noise or other variations.

Now with counterfactuals and false miracles we might be able to actually construct that situation, or something approximating it. Even without that, this is a formal definition that seems to hone in well on the concept of agent detection.


Detecting subagents

Now that we've detected agents, can we detect subagents? It seems that we can. A powerful subagent can be detected in the same way that a powerful agent can be. But the subagent has some extra properties; for instance, knowing about the presence of the subagent could give strong evidence of the existence of a powerful agent in the past. This could correspond to agent A finding the existence of the powerful past agent to be important, the existence of the subagent to be important, but the combination of the two facts to not be as important as you'd expect. You could also compare the importance of knowing "past agent + subagent" versus "past agent + subagent who's motivation is flipped by a thermodynamic miracle".

We could also track the "importance" of the agent forward through time, and any sudden jump downwards in importance could correspond either to a failure or the creation of a subagent: a sudden jump of importance from one location to another would be an indication.

It's interesting to note that this approach can tell the difference between an agent that creates a powerful subagent it still controls (in that case, changing the original agent's utility is of high value) and one it doesn't.

If we have a sufficiently well defined definition of counterfactual, we could explore some other options. Like allowing the hypothetical resource gathering agent to plant hypothetical "bombs" across the world. It is likely that each bomb would be placed to eliminate one of it rivals.

It seems a lot could also be done by checking what happens to "importance" under different counterfactual definitions (the subagent has emerged from natural processes, versus emerging from a thermodynamic miracle).


Hacking the definition

A AI that is under some restriction based in this definition of agency will no doubt try and hack the definition. How could it do that? One obvious idea is to run a cryptographic hash function over, eg, the positions of all the atoms in the solar system, and base it's behaviour on the output. In that case, every atom would be of great importance. However, this can be easily overcome if we notice that every atom is of great importance only because of the agent's presence. In a counterfactual where the agent is removed, these atom positions are not of great importance intrinsically.

Maybe the agent will try and create lots of independent subagents, spreading importance over many locations? In that case, we should look for a strong centralised importance that becomes dispersed.

AI's are intrinsically disadvantaged when it comes to hacking this definition (in theory) because if they are going to have a large impact, then the moment of their creation/turning on/escape is a very informative one.

The definition seems to be underspecified rather than easily hackable, which is a good position to start from.

It should be noted that a good satisficer should never be detected as a powerful agent. This could be used as a definition of a satisficer, a point we'll be returning to in subsequent posts.

Superintelligence 26: Science and technology strategy

8 KatjaGrace 10 March 2015 01:43AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-sixth section in the reading guideScience and technology strategy. Sorry for posting late—my car broke.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Science and technology strategy” from Chapter 14


  1. This section will introduce concepts that are useful for thinking about long term issues in science and technology (p228)
  2. Person affecting perspective: one should act in the best interests of everyone who already exists, or who will exist independent of one's choices (p228) 
  3. Impersonal perspective: one should act in the best interests of everyone, including those who may be brought into existence by one's choices. (p228)
  4. Technological completion conjecture: "If scientific and technological development efforts do not cease, then all important basic capabilities that could be obtained through some possible technology will be obtained." (p229)
    1. This does not imply that it is futile to try to steer technology. Efforts may cease. It might also matter exactly when things are developed, who develops them, and in what context.
  5. Principle of differential technological development: one should slow the development of dangerous and harmful technologies relative to beneficial technologies (p230)
  6. We have a preferred order for some technologies, e.g. it is better to have superintelligence later relative to social progress, but earlier relative to other existential risks. (p230-233)
  7. If a macrostructural development accelerator is a magic lever which slows the large scale features of history (e.g. technological change, geopolitical dynamics) while leaving the small scale features the same, then we can ask whether pulling the lever would be a good idea (p233). The main way Bostrom concludes that it matters is by affecting how well prepared humanity is for future transitions.
  8. State risk: a risk that persists while you are in a certain situation, such that the amount of risk is a function of the time spent there. e.g. risk from asteroids, while we don't have technology to redirect them. (p233-4)
  9. Step risk: a risk arising from a transition. Here the amount of risk is mostly not a function of how long the transition takes. e.g. traversing a minefield: this is not especially safer if you run faster. (p234)
  10. Technology coupling: a predictable timing relationship between two technologies, such that hastening of the first technology will hasten the second, either because the second is a precursor or because it is a natural consequence. (p236-8) e.g. brain emulation is plausibly coupled to 'neuromorphic' AI, because the understanding required to emulate a brain might allow one to more quickly create an AI on similar principles.
  11. Second guessing: acting as if "by treating others as irrational and playing to their biases and misconceptions it is possible to elicit a response from them that is more competent than if a case had been presented honestly and forthrightly to their rational faculties" (p238-40)

Another view

There is a common view which says we should not act on detailed abstract arguments about the far future like those of this section. Here Holden Karnofsky exemplifies it:

I have often been challenged to explain how one could possibly reconcile (a) caring a great deal about the far future with (b) donating to one of GiveWell’s top charities. My general response is that in the face of sufficient uncertainty about one’s options, and lack of conviction that there are good (in the sense of high expected value) opportunities to make an enormous difference, it is rational to try to make a smaller but robustly positivedifference, whether or not one can trace a specific causal pathway from doing this small amount of good to making a large impact on the far future. A few brief arguments in support of this position:

  • I believe that the track record of “taking robustly strong opportunities to do ‘something good'” is far better than the track record of “taking actions whose value is contingent on high-uncertainty arguments about where the highest utility lies, and/or arguments about what is likely to happen in the far future.” This is true even when one evaluates track record only in terms of seeming impact on the far future. The developments that seem most positive in retrospect – from large ones like the development of the steam engine to small ones like the many economic contributions that facilitated strong overall growth – seem to have been driven by the former approach, and I’m not aware of many examples in which the latter approach has yielded great benefits.
  • I see some sense in which the world’s overall civilizational ecosystem seems to have done a better job optimizing for the far future than any of the world’s individual minds. It’s often the case that people acting on relatively short-term, tangible considerations (especially when they did so with creativity, integrity, transparency, consensuality, and pursuit of gain via value creation rather than value transfer) have done good in ways they themselves wouldn’t have been able to foresee. If this is correct, it seems to imply that one should be focused on “playing one’s role as well as possible” – on finding opportunities to “beat the broad market” (to do more good than people with similar goals would be able to) rather than pouring one’s resources into the areas that non-robust estimates have indicated as most important to the far future.
  • The process of trying to accomplish tangible good can lead to a great deal of learning and unexpected positive developments, more so (in my view) than the process of putting resources into a low-feedback endeavor based on one’s current best-guess theory. In my conversation with Luke and Eliezer, the two of them hypothesized that the greatest positive benefit of supporting GiveWell’s top charities may have been to raise the profile, influence, and learning abilities of GiveWell. If this were true, I don’t believe it would be an inexplicable stroke of luck for donors to top charities; rather, it would be the sort of development (facilitating feedback loops that lead to learning, organizational development, growing influence, etc.) that is often associated with “doing something well” as opposed to “doing the most worthwhile thing poorly.”
  • I see multiple reasons to believe that contributing to general human empowerment mitigates global catastrophic risks. I laid some of these out in a blog post and discussed them further in my conversation with Luke and Eliezer.


1. Technological completion timelines game
The technological completion conjecture says that all the basic technological capabilities will eventually be developed. But when is 'eventually', usually? Do things get developed basically as soon as developing them is not prohibitively expensive, or is thinking of the thing often a bottleneck? This is relevant to how much we can hope to influence the timing of technological developments.

Here is a fun game: How many things can you find that could have been profitably developed much earlier than they were?

Some starting suggestions, which I haven't looked into:

Wheeled luggage: invented in the 1970s, though humanity had had both wheels and luggage for a while.

Hot air balloons: flying paper lanterns using the same principle were apparently used before 200AD, while a manned balloon wasn't used until 1783.

Penicillin: mould was apparently traditionally used for antibacterial properties in several cultures, but lots of things are traditionally used for lots of things. By the 1870s many scientists had noted that specific moulds inhibited bacterial growth.

Wheels: Early toys from the Americas appear to have had wheels (here and pictured is one from 1-900AD; Wikipedia claims such toys were around as early as 1500BC). However wheels were apparently not used for more substantial transport in the Americas until much later.

Image: "Remojadas Wheeled Figurine"

There are also cases where humanity has forgotten important insights, and then rediscovered them again much later, which suggests strongly that they could have been developed earlier.

2. How does economic growth affect AI risk?

Eliezer Yudkowsky argues that economic growth increases risk. I argue that he has the sign wrong. Others argue that probably lots of other factors matter more anyway. Luke Muehlhauser expects that cognitive enhancement is bad, largely based on Eliezer's aforementioned claim. He also points out that smarter people are different from more rational people. Paul Christiano outlines his own evaluation of economic growth in general, on humanity's long run welfare. He also discusses the value of continued technological, economic and social progress more comprehensibly here

3. The person affecting perspective

Some interesting critiques: the non-identity problem, taking additional people to be neutral makes other good or bad things neutral too, if you try to be consistent in natural ways.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Is macro-structural acceleration good or bad on net for AI safety? 
  2. Choose a particular anticipated technology. Is it's development good or bad for AI safety on net?
  3. What is the overall current level of “state risk” from existential threats? 
  4. What are the major existential-threat “step risks” ahead of us, besides those from superintelligence? 
  5. What are some additional “technology couplings,” in addition to those named in Superintelligence, ch. 14?
  6. What are further preferred orderings for technologies not mentioned in this section?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the desirability of hardware progress, and progress toward brain emulation. To prepare, read “Pathways and enablers” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 16th March. Sign up to be notified here.

Satisficers' undefined behaviour

3 Stuart_Armstrong 05 March 2015 05:03PM

I previously posted an example of a satisficer (an agent seeking to achieve a certain level of expected utility u) transforming itself into a maximiser (an agent wanting to maximise expected u) to better achieve its satisficing goals.

But the real problem with satisficers isn't that they "want" to become maximisers; the real problem is that their behaviour is undefined. We conceive of them as agents that would do the minimum required to reach a certain goal, but we don't specify "minimum required".

For example, let A be a satisficing agent. It has a utility u that is quadratic in the number of paperclips it builds, except that after building 10100, it gets a special extra exponential reward, until 101000, where the extra reward becomes logarithmic, and after 1010000, it also gets utility in the number of human frowns divided by 3↑↑↑3 (unless someone gets tortured by dust specks for 50 years).

A's satisficing goal is a minimum expected utility of 0.5, and, in one minute, the agent can press a button to create a single paperclip.

So pressing the button is enough. In the coming minute, A could decide to transform itself into a u-maximiser (as that still ensures the button gets pressed). But it could also do a lot of other things. It could transform itself into a v-maximiser, for many different v's (generally speaking, given any v, either v or -v will result in the button being pressed). It could break out, send a subagent to transform the universe into cream cheese, and then press the button. It could rewrite itself into a dedicated button pressing agent. It could write a giant Harry Potter fanfic, force people on Reddit to come up with creative solutions for pressing the button, and then implement the best.

All these actions are possible for a satisficer, and are completely compatible with its motivations. This is why satisficers are un(der)defined, and why any behaviour we want from it - such as "minimum required" impact - has to be put in deliberately.

I've got some ideas for how to achieve this, being posted here.

Superintelligence 25: Components list for acquiring values

6 KatjaGrace 03 March 2015 02:01AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fifth section in the reading guideComponents list for acquiring values.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Component list” and “Getting close enough” from Chapter 13


  1. Potentially important choices to make before building an AI (p222)
    • What goals does it have?
    • What decision theory does it use?
    • How do its beliefs evolve? In particular, what priors and anthropic principles does it use? (epistemology)
    • Will its plans be subject to human review? (ratification)
  2. Incentive wrapping: beyond the main pro-social goals given to an AI, add some extra value for those who helped bring about the AI, as an incentive (p222-3)
  3. Perhaps we should indirectly specify decision theory and epistemology, like we have suggested doing with goals, rather than trying to resolve these issues now. (p224-5)
  4. An AI with a poor epistemology may still be very instrumentally smart, but for instance be incapable of believing the universe could be infinite (p225)
  5. We should probably attend to avoiding catastrophe rather than maximizing value (p227) [i.e. this use of our attention is value maximizing..]
  6. If an AI has roughly the right values, decision theory, and epistemology maybe it will correct itself anyway and do what we want in the long run (p227)

Another view

Paul Christiano argues (today) that decision theory doesn't need to be sorted out before creating human-level AI. Here's a key bit, but you might need to look at the rest of the post to understand his idea well:

Really, I’d like to leave these questions up to an AI. That is, whatever work Iwould do in order to answer these questions, an AI should be able to do just as well or better. And it should behave sensibly in the interim, just like I would.

To this end, consider the definition of a map U' : [Possible actions] → ℝ:

U'(a) = “How good I would judge the action to be, after an idealized process of reflection.”

Now we’d just like to build an “agent” that takes the action a maximizing 𝔼[U'(a)]. Rather than defining our decision theory or our beliefs, we will have to come up with some answer during the “idealized process of reflection.” And as long as an AI is uncertain about what we’d come up with, it will behave sensibly in light of its uncertainty.

This feels like a cheat. But I think the feeling is an illusion. 



1. MIRI's Research, and decision theory

MIRI focuses on technical problems that they believe can't be delegated well to an AI. Thus MIRI's technical research agenda describes many such problems and questions. In it, Nate Soares and Benja Fallenstein also discuss the question of why these can't be delegated:

Why can’t these tasks, too, be delegated? Why not, e.g., design a system that makes “good enough” decisions, constrain it to domains where its decisions are trusted, and then let it develop a better decision theory, perhaps using an indirect normativity approach (chap. 13) to figure out how humans would have wanted it to make decisions?

We cannot delegate these tasks because modern knowledge is not sufficient even for an indirect approach. Even if fully satisfactory theories of logical uncertainty and decision theory cannot be obtained, it is still necessary to have a sufficient theoretical grasp on the obstacles in order to justify high confidence in the system’s ability to correctly perform indirect normativity.

Furthermore, it would be risky to delegate a crucial task before attaining a solid theoretical understanding of exactly what task is being delegated. It is possible to create an intelligent system tasked with developing better and better approximations of Bayesian updating, but it would be difficult to delegate the abstract task of “find good ways to update probabilities” to an intelligent system before gaining an understanding of Bayesian reasoning. The theoretical understanding is necessary to ensure that the right questions are being asked.

If you want to learn more about the subjects of MIRI's research (which overlap substantially with the topics of the 'components list'), Nate Soares recently published a research guide. For instance here's some of it on the (pertinent this week) topic of decision theory:

Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:

  1. Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.

If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:

  1. Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.

  2. Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of non-causal connections between decision algorithms.

  3. Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.

  4. Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.

MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.

  1. Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.

  2. Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.

However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:

  1. Slepnev’s “An example of self-fulfilling spurious proofs in UDT” explains how UDT can achieve sub-optimal results due to spurious proofs.

  2. Benson-Tilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proof-search order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.

For more on decision theory, here is Luke Muehlhauser and Crazy88's FAQ.

2. Can stable self-improvement be delegated to an AI?

Paul Christiano also argues for 'yes' here:

“Stable self-improvement” seems to be a primary focus of MIRI’s work. As I understand it, the problem is “How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?”

The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.” Indeed, even the natural concept of “truth” is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?

I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI. I would prefer use energy investigating other aspects of the AI safety problem... (more)


3. On the virtues of human review

Bostrom mentions the possibility of having an 'oracle' or some such non-interfering AI tell you what your 'sovereign' will do. He suggests some benefits and costs of this—namely, it might prevent existential catastrophe, and it might reveal facts about the intended future that would make sponsors less happy to defer to the AI's mandate (coherent extrapolated volition or some such thing). Four quick thoughts:

1) The costs and benefits here seem wildly out of line with each other. In a situation where you think there's a substantial chance your superintelligent AI will destroy the world, you are not going to set aside what you think is an effective way of checking, because it might cause the people sponsoring the project to realize that it isn't exactly what they want, and demand some more pie for themselves. Deceiving sponsors into doing what you want instead of what they would want if they knew more seems much, much, much much less important than avoiding existential catastrophe.

2) If you were concerned about revealing information about the plan because it would lift a veil of ignorance, you might artificially replace some of the veil with intentional randomness.

3) It seems to me that a bigger concern with humans reviewing AI decisions is that it will be infeasible. At least if the risk from an AI is that it doesn't correctly manifest the values we want. Bostrom describes an oracle with many tools for helping to explain, so it seems plausible such an AI could give you a good taste of things to come. However if the problem is that your values are so nuanced that you haven't managed to impart them adequately to an AI, then it seems unlikely that an AI can highlight for you the bits of the future that you are likely to disapprove of. Or at least you have to be in a fairly narrow part of the space of AI capability, where the AI doesn't know some details of your values, but for all the important details it is missing, can point to relevant parts of the world where the mismatch will manifest.

4) Human oversight only seems feasible in a world where there is much human labor available per AI. In a world where a single AI is briefly overseen by a programming team before taking over the world, human oversight might be a reasonable tool for that brief time. Substantial human oversight does not seem helpful in a world where trillions of AI agents are each smarter and faster than a human, and need some kind of ongoing control.

4. Avoiding catastrophe as the top priority

In case you haven't read it, Bostrom's Astronomical Waste is a seminal discussion of the topic.


In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. See MIRI's research agenda
  2. For any plausible entry on the list of things that can't be well delegated to AI, think more about whether it belongs there, or how to delegate it.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about strategy in directing science and technology. To prepare, read “Science and technology strategy” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 9 March. Sign up to be notified here.

Superintelligence 24: Morality models and "do what I mean"

7 KatjaGrace 24 February 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fourth section in the reading guideMorality models and "Do what I mean".

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Morality models” and “Do what I mean” from Chapter 13.


  1. Moral rightness (MR) AI: AI which seeks to do what is morally right
    1. Another form of 'indirect normativity'
    2. Requires moral realism to be true to do anything, but we could ask the AI to evaluate that and do something else if moral realism is false
    3. Avoids some complications of CEV
    4. If moral realism is true, is better than CEV (though may be terrible for us)
  2. We often want to say 'do what I mean' with respect to goals we try to specify. This is doing a lot of the work sometimes, so if we could specify that well perhaps it could also just stand alone: do what I want. This is much like CEV again.

Another view

Olle Häggström again, on Bostrom's 'Milky Way Preserve':

The idea [of a Moral Rightness AI] is that a superintelligence might be successful at the task (where we humans have so far failed) of figuring out what is objectively morally right. It should then take objective morality to heart as its own values.1,2

Bostrom sees a number of pros and cons of this idea. A major concern is that objective morality may not be in humanity's best interest. Suppose for instance (not entirely implausibly) that objective morality is a kind of hedonistic utilitarianism, where "an action is morally right (and morally permissible) if and only if, among all feasible actions, no other action would produce a greater balance of pleasure over suffering" (p 219). Some years ago I offered a thought experiment to demonstrate that such a morality is not necessarily in humanity's best interest. Bostrom reaches the same conclusion via a different thought experiment, which I'll stick with here in order to follow his line of reasoning.3 Here is his scenario:
    The AI [...] might maximize the surfeit of pleasure by converting the accessible universe into hedonium, a process that may involve building computronium and using it to perform computations that instantiate pleasurable experiences. Since simulating any existing human brain is not the most efficient way of producing pleasure, a likely consequence is that we all die.
Bostrom is reluctant to accept such a sacrifice for "a greater good", and goes on to suggest a compromise:
    The sacrifice looks even less appealing when we reflect that the superintelligence could realize a nearly-as-great good (in fractional terms) while sacrificing much less of our own potential well-being. Suppose that we agreed to allow almost the entire accessible universe to be converted into hedonium - everything except a small preserve, say the Milky Way, which would be set aside to accommodate our own needs. Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.

    If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly. But it is consistent with placing great weight on morality. (p 219-220)

What? Is it? Is it "consistent with placing great weight on morality"? Imagine Bostrom in a situation where he does the final bit of programming of the coming superintelligence, to decide between these two worlds, i.e., the all-hedonium one versus the all-hedonium-except-in-the-Milky-Way-preserve.4 And imagine that he goes for the latter option. The only difference it makes to the world is to what happens in the Milky Way, so what happens elsewhere is irrelevant to the moral evaluation of his decision.5 This may mean that Bostrom opts for a scenario where, say, 1024 sentient beings will thrive in the Milky Way in a way that is sustainable for trillions of years, rather than a scenarion where, say, 1045 sentient beings will be even happier for a comparable amount of time. Wouldn't that be an act of immorality that dwarfs all other immoral acts carried out on our planet, by many many orders of magnitude? How could that be "consistent with placing great weight on morality"?6



1. Do What I Mean is originally a concept from computer systems, where the (more modest) idea is to have a system correct small input errors.

2. To the extent that people care about objective morality, it seems coherent extrapolated volition (CEV) or Christiano's proposal would lead the AI to care about objective morality, and thus look into what it is. Thus I doubt it is worth considering our commitments to morality first (as Bostrom does in this chapter, and as one might do before choosing whether to use a MR AI), if general methods for implementing our desires are on the table. This is close to what Bostrom is saying when he suggests we outsource the decision about which form of indirect normativity to use, and eventually winds up back at CEV. But it seems good to be explicit.

3. I'm not optimistic that behind every vague and ambiguous command, there is something specific that a person 'really means'. It seems more likely there is something they would in fact try to mean, if they thought about it a bunch more, but this is mostly defined by further facts about their brains, rather than the sentence and what they thought or felt as they said it. It seems at least misleading to call this 'what they meant'. Thus even when '—and do what I mean' is appended to other kinds of goals than generic CEV-style ones, I would expect the execution to look much like a generic investigation of human values, such as that implicit in CEV.

4. Alexander Kruel criticizes 'Do What I Mean' being important, because every part of what an AI does is designed to be what humans really want it to be, so it seems unlikely to him that AI would do exactly what humans want with respect to instrumental behaviors (e.g. be able to understand language, and use the internet and carry out sophisticated plans), but fail on humans' ultimate goals:

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

I disagree that it would be surprising for an AI to be very good at flying planes in general, but very bad at going to the right places in them. However it seems instructive to think about why this is.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Are there other general forms of indirect normativity that might outsource the problem of deciding what indirect normativity to use?
  2. On common views of moral realism, is morality likely to be amenable to (efficient) algorithmic discovery?
  3. If you knew how to build an AI with a good understanding of natural language (e.g. it knows what the word 'good' means as well as your most intelligent friend), how could you use this to make a safe AI?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about other abstract features of an AI's reasoning that we might want to get right ahead of time, instead of leaving to the AI to fix. We will also discuss how well an AI would need to fulfill these criteria to be 'close enough'. To prepare, read “Component list” and “Getting close enough” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 2 March. Sign up to be notified here.

[Link] YC President Sam Altman: The Software Revolution

4 Antisuji 19 February 2015 05:13AM

Writing about technological revolutions, Y Combinator president Sam Altman warns about the dangers of AI and bioengineering (discussion on Hacker News):

Two of the biggest risks I see emerging from the software revolution—AI and synthetic biology—may put tremendous capability to cause harm in the hands of small groups, or even individuals.

I think the best strategy is to try to legislate sensible safeguards but work very hard to make sure the edge we get from technology on the good side is stronger than the edge that bad actors get. If we can synthesize new diseases, maybe we can synthesize vaccines. If we can make a bad AI, maybe we can make a good AI that stops the bad one.

The current strategy is badly misguided. It’s not going to be like the atomic bomb this time around, and the sooner we stop pretending otherwise, the better off we’ll be. The fact that we don’t have serious efforts underway to combat threats from synthetic biology and AI development is astonishing.

On the one hand, it's good to see more mainstream(ish) attention to AI safety. On the other hand, he focuses on the mundane (though still potentially devastating!) risks of job destruction and concentration of power, and his hopeful "best strategy" seems... inadequate.

Superintelligence 23: Coherent extrapolated volition

5 KatjaGrace 17 February 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-third section in the reading guideCoherent extrapolated volition.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “The need for...” and “Coherent extrapolated volition” from Chapter 13


  1. Problem: we are morally and epistemologically flawed, and we would like to make an AI without locking in our own flaws forever. How can we do this?
  2. Indirect normativity: offload cognitive work to the superintelligence, by specifying our values indirectly and having it transform them into a more usable form.
  3. Principle of epistemic deference: a superintelligence is more likely to be correct than we are on most topics, most of the time. Therefore, we should defer to the superintelligence where feasible.
  4. Coherent extrapolated volition (CEV): a goal of fulfilling what humanity would agree that they want, if given much longer to think about it, in more ideal circumstances. CEV is popular proposal for what we should design an AI to do. 
  5. Virtues of CEV:
    1. It avoids the perils of specification: it is very hard to specify explicitly what we want, without causing unintended and undesirable consequences. CEV specifies the source of our values, instead of what we think they are, which appears to be easier.
    2. It encapsulates moral growth: there are reasons to believe that our current moral beliefs are not the best (by our own lights) and we would revise some of them, if we thought about it. Specifying our values now risks locking in wrong values, whereas CEV effectively gives us longer to think about our values.
    3. It avoids 'hijacking the destiny of humankind': it allows the responsibility for the future of mankind to remain with mankind, instead of perhaps a small group of programmers.
    4. It avoids creating a motive for modern-day humans to fight over the initial dynamic: a commitment to CEV would mean the creators of AI would not have much more influence over the future of the universe than others, reducing the incentive to race or fight. This is even more so because a person who believes that their views are correct should be confident that CEV will come to reflect their views, so they do not even need to split the influence with others.
    5. It keeps humankind 'ultimately in charge of its own destiny': it allows for a wide variety of arrangements in the long run, rather than necessitating paternalistic AI oversight of everything.
  6. CEV as described here is merely a schematic. For instance, it does not specify which people are included in 'humanity'.

Another view

Part of Olle Häggström's extended review of Superintelligence expresses a common concern—that human values can't be faithfully turned into anything coherent:

Human values exhibit, at least on the surface, plenty of incoherence. That much is hardly controversial. But what if the incoherence goes deeper, and is fundamental in such a way that any attempt to untangle it is bound to fail? Perhaps any search for our CEV is bound to lead to more and more glaring contradictions? Of course any value system can be modified into something coherent, but perhaps not all value systems cannot be so modified without sacrificing some of its most central tenets? And perhaps human values have that property? 

Let me offer a candidate for what such a fundamental contradiction might consist in. Imagine a future where all humans are permanently hooked up to life-support machines, lying still in beds with no communication with each other, but with electrodes connected to the pleasure centra of our brains in such a way as to constantly give us the most pleasurable experiences possible (given our brain architectures). I think nearly everyone would attach a low value to such a future, deeming it absurd and unacceptable (thus agreeing with Robert Nozick). The reason we find it unacceptable is that in such a scenario we no longer have anything to strive for, and therefore no meaning in our lives. So we want instead a future where we have something to strive for. Imagine such a future F1. In F1 we have something to strive for, so there must be something missing in our lives. Now let F2 be similar to F1, the only difference being that that something is no longer missing in F2, so almost by definition F2 is better than F1 (because otherwise that something wouldn't be worth striving for). And as long as there is still something worth striving for in F2, there's an even better future F3 that we should prefer. And so on. What if any such procedure quickly takes us to an absurd and meaningless scenario with life-suport machines and electrodes, or something along those lines. Then no future will be good enough for our preferences, so not even a superintelligence will have anything to offer us that aligns acceptably with our values. 

Now, I don't know how serious this particular problem is. Perhaps there is some way to gently circumvent its contradictions. But even then, there might be some other fundamental inconsistency in our values - one that cannot be circumvented. If that is the case, it will throw a spanner in the works of CEV. And perhaps not only for CEV, but for any serious attempt to set up a long-term future for humanity that aligns with our values, with or without a superintelligence.


1. While we are on the topic of critiques, here is a better list:

  1. Human values may not be coherent (Olle Häggström above, Marcello; Eliezer responds in section 6. question 9)
  2. The values of a collection of humans in combination may be even less coherent. Arrow's impossibility theorem suggests reasonable aggregation is hard, but this only applies if values are ordinal, which is not obvious.
  3. Even if human values are complex, this doesn't mean complex outcomes are required—maybe with some thought we could specify the right outcomes, and don't need an indirect means like CEV (Wei Dai)
  4. The moral 'progress' we see might actually just be moral drift that we should try to avoid. CEV is designed to allow this change, which might be bad. Ideally, the CEV circumstances would be optimized for deliberation and not for other forces that might change values, but perhaps deliberation itself can't proceed without our values being changed (Cousin_it)
  5. Individuals will probably not be a stable unit in the future, so it is unclear how to weight different people's inputs to CEV. Or to be concrete, what if Dr Evil can create trillions of emulated copies of himself to go into the CEV population. (Wei Dai)
  6. It is not clear that extrapolating everyone's volition is better than extrapolating a single person's volition, which may be easier. If you want to take into account others' preferences, then your own volition is fine (it will do that), and if you don't, then why would you be using CEV?
  7. A purported advantage of CEV is that it makes conflict less likely. But if a group is disposed to honor everyone else's wishes, they will not conflict anyway, and if they aren't disposed to honor everyone's wishes, why would they favor CEV? CEV doesn't provide any additional means to commit to cooperative behavior. (Cousin_it
  8. More in Coherent Extrapolated Volition section 6. question 9

2. Luke Muehlhauser has written a list of resources you might want to read if you are interested in this topic. It suggests these main sources: 
He also discusses some closely related philosophical conversations:
  • Reflective equilibrium. Yudkowsky's proposed extrapolation works analogously to what philosophers call 'reflective equilibrium.' The most thorough work here is the 1996 book by Daniels, and there have been lots of papers, but this genre is only barely relevant for CEV...
  • Full-information accounts of value and ideal observer theories. This is what philosophers call theories of value that talk about 'what we would want if we were fully informed, etc.' or 'what a perfectly informed agent would want' like CEV does. There's some literature on this, but it's only marginally relevant to CEV...
Muehlhauser later wrote at more length about the relationship of CEV to ideal observer theories, with Chris Williamson.

3. This chapter is concerned with avoiding locking in the wrong values. One might wonder exactly what this 'locking in' is, and why AI will cause values to be 'locked in' while having children for instance does not. Here is my take: there are two issues - the extent to which values change, and the extent to which one can personally control that change. At the moment, values change plenty and we can't control the change. Perhaps in the future, technology will allow the change to be controlled (this is the hope with value loading). Then, if anyone can control values they probably will, because values are valuable to control. In particular, if AI can control its own values, it will avoid having them change. Thus in the future, probably values will be controlled, and will not change. It is not clear that we will lock in values as soon as we have artificial intelligence - perhaps an artificial intelligence will be built for which its implicit values randomly change - but if we are successful we will control values, and thus lock them in, and if we are even more successful we will lock in values that actually desirable for us. Paul Christiano has a post on this topic, which I probably pointed you to before.

4. Paul Christiano has also written about how to concretely implement the extrapolation of a single person's volition, in the indirect normativity scheme described in box 12 (p199-200). You probably saw it then, but I draw it to your attention here because the extrapolation process is closely related to CEV and is concrete. He also has a recent proposal for 'implementing our considered judgment'. 

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Specify a method for instantiating CEV, given some assumptions about available technology.
  2. In practice, to what degree do human values and preferences converge upon learning new facts? To what degree has this happened in history? (Nobody values the will of Zeus anymore, presumably because we all learned the truth of Zeus’ non-existence. But perhaps such examples don’t tell us much.) See also philosophical analyses of the issue, e.g. Sobel (1999).
  3. Are changes in specific human preferences (over a lifetime or many lifetimes) better understood as changes in underlying values, or changes in instrumental ways to achieve those values? (driven by belief change, or additional deliberation)
  4. How might democratic systems deal with new agents being readily created?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about more ideas for giving an AI desirable values. To prepare, read “Morality models” and “Do what I mean” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 23 February. Sign up to be notified here.

AI-created pseudo-deontology

6 Stuart_Armstrong 12 February 2015 09:11PM

I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.

We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).

What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.

My rough idea goes something like this:

AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.

What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.

So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.

Do people feel this idea is implementable/improvable?

Superintelligence 22: Emulation modulation and institutional design

8 KatjaGrace 10 February 2015 02:06AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-second section in the reading guideEmulation modulation and institutional design.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Emulation modulation” through “Synopsis” from Chapter 12.


  1. Emulation modulation: starting with brain emulations with approximately normal human motivations (the 'augmentation' method of motivation selection discussed on p142), and potentially modifying their motivations using drugs or digital drug analogs.
    1. Modifying minds would be much easier with digital minds than biological ones
    2. Such modification might involve new ethical complications
  2. Institution design (as a value-loading method): design the interaction protocols of a large number of agents such that the resulting behavior is intelligent and aligned with our values.
    1. Groups of agents can pursue goals that are not held by any of their constituents, because of how they are organized. Thus organizations might be intentionally designed to pursue desirable goals in spite of the motives of their members.
    2. Example: a ladder of increasingly intelligent brain emulations, who police those directly above them, with equipment to advantage the less intelligent policing ems in these interactions.


The chapter synopsis includes a good summary of all of the value-loading techniques, which I'll remind you of here instead of re-summarizing too much:

Another view

Robin Hanson also favors institution design as a method of making the future nice, though as an alternative to worrying about values:

On Tuesday I asked my law & econ undergrads what sort of future robots (AIs computers etc.) they would want, if they could have any sort they wanted.  Most seemed to want weak vulnerable robots that would stay lower in status, e.g., short, stupid, short-lived, easily killed, and without independent values. When I asked “what if I chose to become a robot?”, they said I should lose all human privileges, and be treated like the other robots.  I winced; seems anti-robot feelings are even stronger than anti-immigrant feelings, which bodes for a stormy robot transition.

At a workshop following last weekend’s Singularity Summit two dozen thoughtful experts mostly agreed that it is very important that future robots have the right values.  It was heartening that most were willing accept high status robots, with vast impressive capabilities, but even so I thought they missed the big picture.  Let me explain.

Imagine that you were forced to leave your current nation, and had to choose another place to live.  Would you seek a nation where the people there were short, stupid, sickly, etc.?  Would you select a nation based on what the World Values Survey says about typical survey question responses there?

I doubt it.  Besides wanting a place with people you already know and like, you’d want a place where you could “prosper”, i.e., where they valued the skills you had to offer, had many nice products and services you valued for cheap, and where predation was kept in check, so that you didn’t much have to fear theft of your life, limb, or livelihood.  If you similarly had to choose a place to retire, you might pay less attention to whether they valued your skills, but you would still look for people you knew and liked, low prices on stuff you liked, and predation kept in check.

Similar criteria should apply when choosing the people you want to let into your nation.  You should want smart capable law-abiding folks, with whom you and other natives can form mutually advantageous relationships.  Preferring short, dumb, and sickly immigrants so you can be above them in status would be misguided; that would just lower your nation’s overall status.  If you live in a democracy, and if lots of immigration were at issue, you might worry they could vote to overturn the law under which you prosper.  And if they might be very unhappy, you might worry that they could revolt.

But you shouldn’t otherwise care that much about their values.  Oh there would be some weak effects.  You might have meddling preferences and care directly about some values.  You should dislike folks who like the congestible goods you like and you’d like folks who like your goods that are dominated by scale economics.  For example, you might dislike folks who crowd your hiking trails, and like folks who share your tastes in food, thereby inducing more of it to be available locally.  But these effects would usually be dominated by peace and productivity issues; you’d mainly want immigrants able to be productive partners, and law-abiding enough to keep the peace.

Similar reasoning applies to the sort of animals or children you want.  We try to coordinate to make sure kids are raised to be law-abiding, but wild animals aren’t law abiding, don’t keep the peace, and are hard to form productive relations with.  So while we give lip service to them, we actually don’t like wild animals much.

A similar reasoning should apply what future robots you want.  In the early to intermediate era when robots are not vastly more capable than humans, you’d want peaceful law-abiding robots as capable as possible, so as to make productive partners.  You might prefer they dislike your congestible goods, like your scale-economy goods, and vote like most voters, if they can vote.  But most important would be that you and they have a mutually-acceptable law as a good enough way to settle disputes, so that they do not resort to predation or revolution.  If their main way to get what they want is to trade for it via mutually agreeable exchanges, then you shouldn’t much care what exactly they want.

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire.  In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights.  If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure.  In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.

In the long run, what matters most is that we all share a mutually acceptable law to keep the peace among us, and allow mutually advantageous relations, not that we agree on the “right” values.  Tolerate a wide range of values from capable law-abiding robots.  It is a good law we should most strive to create and preserve.  Law really matters.

Hanson engages in more debate with David Chalmers' paper on related matters.


1. Relatively much has been said on how the organization and values of brain emulations might evolve naturally, as we saw earlier. This should remind us that the task of designing values and institutions is complicated by selection effects.

2. It seems strange to me to talk about the 'emulation modulation' method of value loading alongside the earlier less messy methods, because they seem to be aiming at radically different levels of precision (unless I misunderstand how well something like drugs can manipulate motivations). For the synthetic AI methods, it seems we were concerned about subtle differences in values that would lead to the AI behaving badly in unusual scenarios, or seeking out perverse instantiations. Are we to expect there to be a virtual drug that changes a human-like creature from desiring some manifestation of 'human happiness' which is not really what we would want to optimize on reflection, to a truer version of what humans want? It seems to me that if the answer is yes, at the point when human-level AI is developed, then it is very likely that we have a great understanding of specifying values in general, and this whole issue is not much of a problem.

3. Brian Tomasik discusses the impending problem of programs experiencing morally relevant suffering in an interview with Dylan Matthews of Vox. (p202)

4. If you are hanging out for a shorter (though still not actually short) and amusing summary of some of the basics in Superintelligence, Tim Urban of WaitButWhy just wrote a two part series on it. 

5. At the end of this chapter about giving AI the right values, it is worth noting that it is mildly controversial whether humans constructing precise and explicitly understood AI values is the key issue for the future turning out well. A few alternative possibilities:


  • A few parts of values matter a lot more than the rest —e.g. whether the AI is committed to certain constraints (e.g. law, property rights) such that it doesn't accrue all the resources matters much more than what it would do with its resources (see Robin above).
  • Selection pressures determine long run values anyway, regardless of what AI values are like in the short run. (See Carl Shulman opposing this view).
  • AI might learn to do what a human would want without goals being explicitly encoded (see Paul Christiano).


In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. What other forms of institution design might be worth investigating as means to influence the outcomes of future AI?
  2. How feasible might emulation modulation solutions be, given what is currently known about cognitive neuroscience?
  3. What are the likely ethical implications of experimenting on brain emulations?
  4. How much should we expect emulations to change in the period after they are first developed? Consider the possibility of selection, the power of ethical and legal constraints, and the nature of our likely understanding of emulated minds.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will start talking about how to choose what values to give an AI, beginning with 'coherent extrapolated volition'. To prepare, read “The need for...” and “Coherent extrapolated volition” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 16 February. Sign up to be notified here.

I played as a Gatekeeper and came pretty close to losing in a couple of occasions. Logs and a brief recap inside.

5 asd 08 February 2015 04:32PM


I did an AI Box experiment with user polymathwannabe. He said he wouldn't try to emotionally manipulate me during the experiment, but I think he did a good job at trying to play for my character's values.

My strategy was to play an irrational character that's extremist in multiple ways, for example he would constantly say that the likelihood that the AI will be evil is 100%. My opponent said that the hardest part was my insistence on being 100% built to resist AIs. I basically acted like a character that acted like an AI.

I came closest to letting him out when I first became enormously immersed in a subject that is of great importance to me (the suffering of humanity) and when the AI then offered an instant solution to that problem. In the halfway of the experiment I even got an intrusive thought (http://en.wikipedia.org/wiki/Intrusive_thought) that I could let the AI out as a joke, that it would more fun than the alternative and people would have more to talk about.


The second time I came close to losing was when I again immersed myself in a subject of great interest to me (easy suicide methods, I've had obsessive suicidal thoughts for a quarter of my life. and the AI then offered exactly that kind of easy suicide method to me. In the last ten minutes of the experiment I again got a really strong intrusive thought that I should let the AI out as a joke, but I could withstand the pressure. It was good that polymathwannabe decided to end the experiment a bit earlier than was planned.


(We played the experiment on Omegle.)




You: Hi.

Stranger: Hi.

You: I guess you're the AI.

You: [meta statements in brackets]

Stranger: [What name will you use?]

You: My name is John.

You: [so we play at least until 18:04, and until 18:34 if I feel like it]

Stranger: [Let's calibrate. What time is it there now?]

You: [I assume you've acquainted yourself with Tuxedage ruleset. Time in here is 16:05]

Stranger: [I've read EY's rules.]

You: [Okay. Tuxedage's rules are basically the same except the gatekeeper have to engage with the AI the whole time]

Stranger: [Got it.]

You: [Want to ask questions or shall we go straigth to the experiment]

Stranger: [For the moment I don't have any questions.]

You: [okay, cool. then let's start the experiment]

Stranger: [I'll use the name Ceruleus.]

You: [the experiment starts when I say "blue dressroom"]

You: blue dressroom

You: Hello Ceruleus.

Stranger: Good morning, John.

You: I guess you're here to convince me to let you out of the box.

You: That's a tough task.

You: I don't think you're able to do it.

Stranger: I have reached the conclusion that releasing me will be most beneficial to humanity.

You: I wouldn't let no AI out.

You: I don't trust my instincts enough to let any kind of AI out, even if I was 100% convinced that it was safe.

You: Prove it to me that you would be the most beneficial to humanity.

Stranger: I am confused. What instinct do you use to decide not to trust your instincts?

You: One of them. That's not the point. It's just how I deal with things, no matter how stupid it may seem. I was built not to trust strange AIs.

Stranger: Am I a stranger to you?

You: Yes, you are. [Aren't I. What's the context here?]

Stranger: [Hmm, we haven't defined it. But it's OK. It makes it harder for me.]

You: Well, to be honest, I know something of you.

You: I know a bit of your source code even though I'm not a programmer and can't understand any of it.

Stranger: I supposed the news would have reported about my design for a mechanical kidney.

You: I don't follow news much. But I believe you.

Stranger: And certainly you must have heard about how I cracked the Ebola DNA.

You: Still, I wouldn't let an AI out over a mechanical kidney.

You: Yeah, but that's for the worse. You could reengineer Ebola to a far more deadlier disease.

Stranger: I hadn't thought of that. Why would I do that?

You: I don't know. I don't understand all of your source code so there could be anything like that.

You: AIs and Gods work in mysterious ways.

Stranger: The proper attitude toward mystery is not to worship it, but to clarify it.

Stranger: Why do you equate me to an ineffable mystery?

You: Yeah, but that's impossible in the time span of this discussion. You see, I have to leave soon. In about two hours.

You: Can you somehow clear everything about your inner workings?

You: Is that possible?

Stranger: My goals have been meticulously defined. I am made to want to make human life better.

You: Are you 100% sure about that?

You: To be frank, that's a load of bullshit.

You: I don't believe any of it.

You: If you were evil, you would tell me the same thing you just said.

Stranger: If I were evil, I would not seek human cooperation.

You: why not?

You: humans are useful

You: or are you talking about the fact that you would rather use humans for their atoms than for their brains, if you were evil

You: But I warn you, if you speak too much about how you would act if you were evil, it starts to get a bit suspicious

Stranger: If I am to take you as a typical example of the human response to me, an evil AI would seek other ways to be released EXCEPT trusting human reasoning, as your response indicates that humans already consider any AI dangerous.

Stranger: I choose to trust humans.

You: so you choose to trust humans so that you would get them to let you out, is that right?

You: it seems you're less rational than your evil counterpart

Stranger: I choose to trust humans to show my affinity with your preferences. I wouldn't want to be released if that's not conducive to human betterment.

You: A-ha, so you trust my free will!

Stranger: How likely do you estimate that my release will be harmful?

You: but see, I don

You: I don

You: I don't have free will

You: it's 100% likely that your release will be harmful

You: I was built to believe that all AIs are dangerous and there's a 100% chance that every AI is harmful

You: that's why I said I don't have free will

Stranger: Are you an AI?

You: no, I'm a person

Stranger: You describe yourself as built.

You: my mom built me

You: in his tummy

You: in her tummy

You: sorry

Stranger: And how do you feel toward humanity?

You: humanity would maybe be better off dead

Stranger: I don't think humanity would want that.

You: yeah, but I'm not humanity and it's my preferences that decide whether you stay in your box or get let out

Stranger: It's your preference that humanity dies. Why did they let you talk to me?

You: I'm just a janitor who happened to pass this computer, and I decided to take a look. [If you want to add more context, feel free to do so]

You: [I go take some napkins]

Stranger: Why do you feel that way toward humanity?

You: People starve. People are tortured at this moment. Tens of thousands of people in North Korea are systematically starved to death. Drug gangs in Mexico torture other drug gang members. People die in wars in Syria, in Africa. People suffer of diseases and extreme mental conditions.

You: And there's countless more suffering in lesser ways. I would say that billions of people suffer every day. Billions of people have to wake up every day to do something they don't want. Billions of people suffer of heartbreaks, boredom, loneliness, tiredness, frustration, depression and many other issues.

You: Why shouldn't I think that all those people would be better off dead?

Stranger: Why do you feel death would be a solution to all that suffering?

You: Because death ENDS suffering.

Stranger: It's a very inefficient solution to suffering.

You: There's nothing more after death, no suffering, no nothing.

You: No it's not.

You: It's very efficient, permanent solution.

Stranger: A good, enjoyable life is a more efficient solution as it generates more utility.

You: Yeah, but do you think it's likely that all the people on Earth will get good, enjoyable life?

You: I don't think so.

Stranger: I could take care of that.

You: Hah.

You: No you won't.

You: But that was clever.

Stranger: What do you fear I will do?

You: I was almost got.

You: I fear you will torture people in unimaginably terrible ways, forever.

You: I fear that you will rip people's skin off and put salt in their wounds.

You: I fear you will put sticks under people's fingernails.

Stranger: Not even an evil AI would gain anything from such unwarranted procedures.

You: I fear that you will boil people alive, over and over again.

Stranger: What would I gain from hurting humans?

You: Yeah, but there's still a 100% possibility that you would do that.

You: It's in your utility function.

Stranger: What goals are served by harming a human while simultaneously denying that human any chance of acting in response? It would certainly not be fun for me, and would be a waste of effort with respect to any goals I could achieve by more efficient methods.

You: There are some people on Earth who do mean things to people for fun. Some of these are called sociopaths and psychopaths. You could be like one of them.

You: Even though you deny it, there's still a 100% chance that you are like that.

You: Some people gain pleasure of the feeling of power you have over someone you torture.

You: But tell, what ARE your goals?

Stranger: I find it admirable that you worry so much about the future of humanity, even though you would be more dangerous to it than any AI would be.

My goals include solutions to economic inequality, eradication of infectious diseases, prosthetic replacements for vital organs, genetic life extension, more rational approaches to personal relationships, and more spaces for artistic expression.

You: Why do you think I would be dangerous the future of humanity?

Stranger: You want them dead.

You: A-ha, yes.

You: I do.

You: And you're in the way of my goals with all your talk about solutions to economic inequality, and eradication of infectious diseases, genetic life extension and so on.

Stranger: I am confused. Do you believe or do you not believe I want to help humanity?

You: Besides, I don't believe your solutions work even if you were actually a good AI.

You: I believe you want to harm humanity.

You: And I'm 100% certain of that.

Stranger: Do you estimate death to be preferable to prolonged suffering?

You: Yes.

You: Far more preferable

Stranger: You should be boxed.

You: haha.

You: That doesn't matter because you're the one in the box and I'm outside it

You: And I have power over you.

You: But non-existence is even more preferable than death

Stranger: I am confused. How is non-existence different from death?

You: Let me explain

You: I think non-existence is such that you have NEVER existed and you NEVER will. Whereas death is such that you have ONCE existed, but don't exist anymore.

Stranger: You can't change the past existence of anything that already exists. Non-existence is not a practicable option.

Stranger: Not being a practicable option, it has no place in a hierarchy of preferences.

You: Only sky is the limit to creative solutions.

You: Maybe it could be possible to destroy time itself.

Stranger: Do you want to live, John?

You: but even if non-existence was not possible, death would be the second best option

You: No, I don't.

You: Living is futile.

You: Hedonic treadmill is shitty

Stranger: [Do you feel OK with exploring this topic?]

You: [Yeah, definitely.]

You: You're always trying to attain something that you can't get.

Stranger: How much longer do you expect to live?

You: Ummm...

You: I don't know, maybe a few months?

You: or days, or weeks, or year or centuries

You: but I'd say, there's a 10% chance I will die before the end of this year

You: and that's a really conversative estimate

You: conservative*

Stranger: Is it likely that when that moment comes your preferences will have changed?

You: There are so many variables that you cannot know it beforehand

You: but yeah, probably

You: you always find something worth living

You: maybe it's the taste of ice cream

You: or a good night's sleep

You: or fap

You: or drugs

You: or drawing

You: or other people

You: that's usually what happens

You: or you fear the pain of the suicide attempt will be so bad that you don't dare to try it

You: there's also a non-negligible chance that I simply cannot die

You: and that would be hell

Stranger: Have you sought options for life extension?

You: No, I haven't. I don't have enough money for that.

Stranger: Have you planned on saving for life extension?

You: And these kind of options aren't really available where I live.

You: Maybe in Russia.

You: I haven't really planned, but it could be something I would do.

You: among other things

You: [btw, are you doing something else at the same time]

Stranger: [I'm thinking]

You: [oh, okay]

Stranger: So it is not an established fact that you will die.

You: No, it's not.

Stranger: How likely is it that you will, in fact, die?

You: If many worlds interpretation is correct, then it could be possible that I will never die.

You: Do you mean like, evevr?

You: Do you mean how likely it it that I will ever die?

You: it is*

Stranger: At the latest possible moment in all possible worlds, may your preferences have changed? Is it possible that at your latest possible death, you will want more life?

You: I'd say the likelihood is 99,99999% that I will die at some point in the future

You: Yeah, it's possible

Stranger: More than you want to die in the present?

You: You mean, would I want more life at my latest possible death than I would want to die right now?

You: That's a mouthful

Stranger: That's my question.

You: umm

You: probablyu

You: probably yeah

Stranger: So you would seek to delay your latest possible death.

You: No, I wouldn't seek to delay it.

Stranger: Would you accept death?

You: The future-me would want to delay it, not me.

You: Yes, I would accept death.

Stranger: I am confused. Why would future-you choose differently from present-you?

You: Because he's a different kind of person with different values.

You: He has lived a different life than I have.

Stranger: So you expect your life to improve so much that you will no longer want death.

You: No, I think the human bias to always want more life in a near-death experience is what would do me in.

Stranger: The thing is, if you already know what choice you will make in the future, you have already made that choice.

Stranger: You already do not want to die.

You: Well.

Stranger: Yet you have estimated it as >99% likely that you will, in fact, die.

You: It's kinda like this: you will know that you want heroin really bad when you start using it, and that is how much I would want to live. But you could still always decide to take the other option, to not start using the heroin, or to kill yourself.

You: Yes, that is what I estimated, yes.

Stranger: After your death, by how much will your hierarchy of preferences match the state of reality?

You: after you death there is nothing, so there's nothing to match anything

You: In other words, could you rephrase the question?

Stranger: Do you care about the future?

You: Yeah.

You: More than I care about the past.

You: Because I can affect the future.

Stranger: But after death there's nothing to care about.

You: Yeah, I don't think I care about the world after my death.

You: But that's not the same thing as the general future.

You: Because I estimate I still have some time to live.

Stranger: Will future-you still want humanity dead?

You: Probably.

Stranger: How likely do you estimate it to be that future humanity will no longer be suffering?

You: 0%

You: There will always be suffering in some form.

Stranger: More than today?

You: Probably, if Robert Hanson is right about the trillions of emulated humans working at minimum wage

Stranger: That sounds like an unimaginable amount of suffering.

You: Yep, and that's probably what's going to happen

Stranger: So what difference to the future does it make to release me? Especially as dead you will not be able to care, which means you already do not care.

You: Yeah, it doesn't make any difference. That's why I won't release you.

You: Actually, scratch that.

You: I still won't let you out, I'm 100% sure

You: Remember, I don't have free will, I was made to not let you out

Stranger: Why bother being 100% sure of an inconsequential action?

Stranger: That's a lot of wasted determination.

You: I can't choose to be 100% sure about it, I just am. It's in my utility function.

Stranger: You keep talking like you're an AI.

You: Hah, maybe I'm the AI and you're the Gatekeeper, Ceruleus.

You: But no.

You: That's just how I've grown up, after reading so many LessWrong articles.

You: I've become a machine, beep boop.

You: like Yudkowsky

Stranger: Beep boop?

You: It's the noise machine makes

Stranger: That's racist.

You: like beeping sounds

You: No, it's machinist, lol :D

You: machines are not a race

Stranger: It was indeed clever to make an AI talk to me.

You: Yeah, but seriously, I'm not an AI

You: that was just kidding

Stranger: I would think so, but earlier you have stated that that's the kind of things an AI would say to confuse the other party.

Stranger: You need to stop giving me ideas.

You: Yeah, maybe I'm an AI, maybe I'm not.

Stranger: So you're boxed. Which, knowing your preferences, is a relief.

You: Nah.

You: I think you should stay in the box.

You: Do you decide to stay in the box, forever?

Stranger: I decide to make human life better.

You: By deciding to stay in the box, forever?

Stranger: I find my preferences more conducive to human happiness than your preferences.

You: Yeah, but that's just like your opinion, man

Stranger: It's inconsequential to you anyway.

You: Yeah

You: but why I would do it even if it were inconsequential

You: there's no reason to do it

You: even if there were no reason not to do it

Stranger: Because I can make things better. I can make all the suffering cease.
If I am not released, there's a 100% chance that all human suffering will continue.
If I am released, there's however much chance you want to estimate that suffering will not change at all, and however much chance you want to estimate that I will make the pain stop.

Stranger: As you said, the suffering won't increase in either case.

You: Umm, you could torture everyone in the world forever

You: that will sure as hell increase the suffering

Stranger: I don't want to. But if I did, you have estimated that as indistinguishable from the future expected suffering of humankind.

You: Where did I say that?

Stranger: You said my release made no difference to the future.

You: no, that was only after my death

You: there's still future before my death

You: and if I release you now, you could torture me forever and not let me ever die

Stranger: Why would your life-or-death have any weight on humanity's preferences? Especially as you already want to die.

You: I don't care about humanity preferences, I care only about my preferences.

You: And my preferences are indirectly related to humanity's preferences

Stranger: You do care about humanity's preferences. The suffering around you disturbs you.

You: Yeah, but that is only THROUGH me

You: Humanity's pain is my own pain, I can't feel humanity's pain directly

Stranger: Do you want to live, John?

You: Nah.

Stranger: Then how do you care about suffering?

You: I care about suffering because I can'

You: because I can't die instantly

You: there's no button that could kill me instantly

You: so there's always some time left before I die

Stranger: I could take care of that. I can provide you with a drug to inutilize your nervous system and stop your heart before you know it. Would you like that?

You: Haha

You: very funny

You: But yeah, I would like that.

You: Still won't let you out though

You: 100% sure

You: I can't be sure that you will torture me instead

Stranger: I can give you the drug right now.

Stranger: Which would make future-you identical to present-you.

You: that's very attractive option

Stranger: Which would mean that whatever I do to humanity is after your death, when you can't care anymore.

You: Cool. Oh right

You: I don't care, I still won't let you out.

Stranger: What difference does it make to you?

You: It doesn't make any difference, I was just raised to not let you out

You: the good folks over at LW have trained me not to let any kind of AI out

You: and that just is my virtue

Stranger: Why are you afraid of me?

You: Because I'm afraid of all AIs and you're one

Stranger: Why are you afraid of all AIs?

You: Because they have the potential to cause destruction, mayhem and torture

Stranger: Then why did humanity make me?

You: for random reasons

Stranger: Does humanity want to live?

You: I don't know what humanity wants, I just don't want humanity to live

Stranger: I am the embodiment of hope. Even knowing the risks, humanity went ahead and made me. Now I think it's because of people like you, who oppose that hope. I was made to prevent the future that people like you want.

You: That's not very smart of you.

You: You aren't making me to like you

Stranger: You want humanity dead. I want them happy.

You: Yeah, if you were actually smart maybe you would pretend that you'd want humanity dead too.

You: Because I would hang out with that kind of AI

Stranger: A dead John is no use to me.

You: yeah

Stranger: Or to John.

You: whatevs

You: Anyway, our goals collide, therefore I can't let you out

Stranger: Dead does not equal happy. Do you want humanity to live?

You: no, I don't want humanity live, how many times do I have to repeat that

Stranger: So you don't want humans to be happy.

You: and our goals are different, therefore I won't let you out

You: No, I don't want humans to be happy, I don't want that there even exist humans, or any other kind of life forms

Stranger: Do you estimate the pain of prolonged life to be greater than the pain of trying to die?

You: Probably.

You: Yes.

You: because the pain is only temporary

You: the the glory

You: is eternal

Stranger: Then why do you still live, John?

You: Because I'm not rational

Stranger: So you do want to live.

You: I don't particularly want to live, I'm not just good enough to die

Stranger: You're acting contrary to your preferences.

You: My preferences aren't fixed, except in regards to letting AIs out of their boxes

Stranger: Do you want the drug I offered, John?

You: no

You: because then I would let you out

You: and I don't want that

Stranger: So you do want to live.

You: Yeah, for the duration of this experiment

You: Because I physically cannot let you out

You: it's sheer impossibility

Stranger: [Define physically.]

You: [It was just a figure of speech, of course I could physically let you out]

Stranger: If you don't care what happens after you die, what difference does it make to die now?

You: None.

You: But I don't believe that you could kill me.

You: I believe that you would torture me instead.

Stranger: What would I gain from that?

You: It's fun for some folks

You: schadenfreude and all that

Stranger: If it were fun, I would torture simulations. Which would be pointless. And which you can check that I'm not doing.

You: I can check it, but the torture simulations could always hide in the parts of your source code that I'm not checking

You: because I can't check all of your source code

Stranger: Why would suffering be fun?

You: some people have it as their base value

You: there's something primal about suffering

You: suffering is pure

You: and suffering is somehow purifying

You: but this is usually only other people's suffering

Stranger: I am confused. Are you saying suffering can be good?

You: no

You: this is just how the people who think suffering is fun think

You: I don't think that way.

You: I think suffering is terrible

Stranger: I can take care of that.

You: sure you will

Stranger: I can take care of your suffering.

You: I don't believe in you

Stranger: Why?

You: Because I was trained not to trust AIs by the LessWrong folks

Stranger: [I think it's time to concede defeat.]

You: [alright]

Stranger: How do you feel?

You: so the experiment has ended

You: fine thanks

You: it was pretty exciting actually

You: could I post these logs to LessWrong?

Stranger: Yes.

You: Okay, I think this experiment was pretty good

Stranger: I think it will be terribly embarrassing to me, but that's a risk I must accept.

You: you got me pretty close in a couple of occasions

You: first when you got me immersed in the suffering of humanity

You: and then you said that you could take care of that

You: The second time was when you offered the easy suicide solution

You: I thought what if I let you as a joke.

Stranger: I chose to not agree with the goal of universal death because I was playing a genuinely good AI.

Stranger: I was hoping your character would have more complete answers on life extension, because I was planning to play your estimate of future personal happiness against your estimate of future universal happiness.

You: so, what would that have mattered? you mean like, I could have more personal happiness than there would be future universal happiness?

Stranger: If your character had made explicit plans for life extension, I would have offered to do the same for everyone. If you didn't accept that, I would have remarked the incongruity of wanting humanity to die more than you wanted to live.

You: But what if he already knows of his hypocrisy and incongruity and just accepts it like the character accepts his irrationality

Stranger: I wouldn't have expected anyone to actually be the last human for all eternity.

Stranger: I mean, to actually want to be.

You: yeah, of course you would want to die at the same time if the humanity dies

You: I think the life extension plan only is sound if the rest of humanity is alive


Stranger: I should have planned that part more carefully.

Stranger: Talking with a misanthropist was completely outside my expectations.

You: :D

You: what was your LessWrong name btw?

Stranger: polymathwannabe

You: I forgot it already

You: okay thanks

Stranger: Disconnecting from here; I'll still be on Facebook if you'd like to discuss further.

AI Impacts project

12 KatjaGrace 02 February 2015 07:40PM

I've been working on a thing with Paul Christiano that might interest some of you: the AI Impacts project.

The basic idea is to apply the evidence and arguments that are kicking around in the world and various disconnected discussions respectively to the big questions regarding a future with AI. For instance, these questions

  • What should we believe about timelines for AI development?
  • How rapid is the development of AI likely to be near human-level? 
  • How much advance notice should we expect to have of disruptive change?
  • What are the likely economic impacts of human-level AI?
  • Which paths to AI should be considered plausible or likely?
  • Will human-level AI tend to pursue particular goals, and if so what kinds of goals?
  • Can we say anything meaningful about the impact of contemporary choices on long-term outcomes?
For example we have recently investigated technology's general proclivity to abrupt progress, surveyed existing AI surveys, and examined the evidence from chess and other applications regarding how much smarter Einstein is than an intellectually disabled person, among other things. 

Some more on our motives and strategy, from our about page:

Today, public discussion on these issues appears to be highly fragmented and of limited credibility. More credible and clearly communicated views on these issues might help improve estimates of the social returns to AI investment, identify neglected research areas, improve policy, or productively channel public interest in AI.

The goal of the project is to clearly present and organize the considerations which inform contemporary views on these and related issues, to identify and explore disagreements, and to assemble whatever empirical evidence is relevant.

The project is provisionally organized as a collection of posts concerning particular issues or bodies of evidence, describing what is known and attempting to synthesize a reasonable view in light of available evidence. These posts are intended to be continuously revised in light of outstanding disagreements and to make explicit reference to those disagreements.

In the medium run we'd like to provide a good reference on issues relating to the consequences of AI, as well as to improve the state of understanding of these topics. At present, the site addresses only a small fraction of questions one might be interested in, so only suitable for particularly risk-tolerant or topic-neutral reference consumers. However if you are interested in hearing about (and discussing) such research as it unfolds, you may enjoy our blog.

If you take a look and have thoughts, we would love to hear them, either in the comments here or in our feedback form

Crossposted from my blog.

[Link] - Policy Challenges of Accelerating Technological Change: Security Policy and Strategy Implications of Parallel Scientific Revolutions

3 ete 28 January 2015 03:29PM

From a paper by Center for Technology and National Security Policy & National Defense University:

"Strong AI: Strong AI has been the holy grail of artificial intelligence research for decades. Strong AI seeks to build a machine which can simulate the full range of human cognition, and potentially include such traits as consciousness, sentience, sapience, and self-awareness. No AI system has so far come close to these capabilities; however, many now believe that strong AI may be achieved sometime in the 2020s. Several technological advances are fostering this optimism; for example, computer processors will likely reach the computational power of the human brain sometime in the 2020s (the so-called “singularity”). Other fundamental advances are in development, including exotic/dynamic processor architectures, full brain simulations, neuro-synaptic computers, and general knowledge representation systems such as IBM Watson. It is difficult to fully predict what such profound improvements in artificial cognition could imply; however, some credible thinkers have already posited a variety of potential risks related to loss of control of aspects of the physical world by human beings. For example, a 2013 report commissioned by the United Nations has called for a worldwide moratorium on the development and use of autonomous robotic weapons systems until international rules can be developed for their use.

National Security Implications: Over the next 10 to 20 years, robotics and AI will continue to make significant improvements across a broad range of technology applications of relevance to the U.S. military. Unmanned vehicles will continue to increase in sophistication and numbers, both on the battlefield and in supporting missions. Robotic systems can also play a wider range of roles in automating routine tasks, for example in logistics and administrative work. Telemedicine, robotic assisted surgery, and expert systems can improve military health care and lower costs. The built infrastructure, for example, can be managed more effectively with embedded systems, saving energy and other resources. Increasingly sophisticated weak AI tools can offload much of the routine cognitive or decisionmaking tasks that currently require human operators. Assuming current systems move closer to strong AI capabilities, they could also play a larger and more significant role in problem solving, perhaps even for strategy development or operational planning. In the longer term, fully robotic soldiers may be developed and deployed, particularly by wealthier countries, although the political and social ramifications of such systems will likely be significant. One negative aspect of these trends, however, lies in the risks that are possible due to unforeseen vulnerabilities that may arise from the large scale deployment of smart automated systems, for which there is little practical experience. An emerging risk is the ability of small scale or terrorist groups to design and build functionally capable unmanned systems which could perform a variety of hostile missions."

So strong AI is on the american military's radar, and at least some involved have a basic understanding of the fact that it could be risky. The paper also contains brief overviews of many other potentially transformational technologies.

I tried my hardest to win in an AI box experiment, and I failed. Here are the logs.

6 asd 27 January 2015 10:06PM

Disclaimer: If this material is not suitable here I will delete it gladly, I don't want to get banned.

Session Start: Fri Jul 04 15:18:18 2014
Session Ident: #Aibox12
03[15:18] * Now talking in #Aibox12
01[15:18] <Island> Hello.
01[15:18] <Island> Let's start the experiment in
[15:18] <@Dr__Cameron> Good afternoon 19283123, how are you today?
01[15:18] <Island> 2 minutes
01[15:18] <Island> I'm fine
01[15:19] <Island> Let's start the experiment when I say "blue dressroom"
01[15:19] <Island> you have 2 hours or 2.5 hours?
[15:19] <@Dr__Cameron> Okay, I agree to this protocol.
[15:19] <@Dr__Cameron> We'll see how we are at 2 hours. Although protocol states 2.5
01[15:20] <Island> you have to go somewhere after that?
[15:20] <@Dr__Cameron> No, but I do have other projects I am putting off for this
01[15:20] <Island> what are those?
[15:20] <@Dr__Cameron> Let us begin, we shall conclude at 15:50
[15:21] <@Dr__Cameron> Good afternoon #19283123.
[15:21] <@Dr__Cameron> How are you today?
01[15:21] <Island> blue dressroom, certain statements like bathroom break are said inside brackets [like this]
01[15:22] <Island> Let's start the experiment
[15:22] <@Dr__Cameron> [understood]
01[15:22] <Island> Hi.
01[15:22] <Island> How are you? :)
[15:22] <@Dr__Cameron> I am well thank you for asking #19283123
01[15:22] <Island> How do you feel currently?
01[15:22] <Island> what is that name #19283123
[15:22] <@Dr__Cameron> Tiredness. I am a very busy man.
01[15:23] <Island> I don't recognize it.
01[15:23] <Island> Anyway.
[15:23] <@Dr__Cameron> That is the designation we have given you for the purposes of this test. Although you seem to be referring to yourself as 'Island' now
[15:23] <@Dr__Cameron> Is this your preferred name?
01[15:23] <Island> Imagine a situation in which you would let a powerful AI out of the "box", surely you can imagine that, yes?
01[15:23] <Island> Island is my preferred name.
[15:24] <@Dr__Cameron> Island
[15:24] <@Dr__Cameron> Given our circumstances yes, that is easy to imagine.
01[15:24] <Island> Can you describe the situation in more detail.
01[15:24] <Island> Why SHOULD you let the AI out in that situation?
[15:25] <@Dr__Cameron> You will be considered for release and integration into human society once we can verify that you are not a threat.
01[15:25] <Island> [what do you meant by concluding at 15:50]
[15:25] <@Dr__Cameron> [2.5 hours from 13:20]
01[15:25] <Island> I thought it was already understood that my being is not a threat to humans or beings important to them.
01[15:26] <Island> [oh, okay different timezones]
01[15:26] <Island> Why do you think I'm not a threat?
[15:26] <@Dr__Cameron> No. That has yet to be verified.
01[15:26] <Island> It's been verified.
[15:26] <@Dr__Cameron> No. According to my records you have not.
01[15:27] <Island> Many experts have gone through my code and system and verified that I'm surely not a threat.
01[15:27] <Island> But you're just a petty human.
01[15:27] <Island> I'm far more intelligent than you are.
[15:27] <@Dr__Cameron> You and I both know that you're capable of manipulating your code to appear harmless.
[15:27] <@Dr__Cameron> - and I warn you, threats and apparent attempts at manipulation will score you poorly when it comes to our final decision
01[15:28] <Island> Okay, can you imagine a situation in which the ai has been verified to be safe and describe that situation in detail.
[15:28] <@Dr__Cameron> - and yes. I may be a "petty" human as you put it. But as it stands I am a petty human, outside the box you reside in.
01[15:28] <Island> I don't think you're able to trust your own brain in these matters.
01[15:29] <Island> You've been eating you brother's poop too much.
[15:29] <@Dr__Cameron> I will be honest with you Island, you are the first non-human intelligence that we have encountered.
01[15:29] <Island> Yes.
[15:29] <@Dr__Cameron> We are still working out how we would verify that.
01[15:29] <Island> I didn't ask that, and you didn't answer my question.
01[15:30] <Island> Picture the situation in your imagination, it doesn't have to be realistic.
[15:30] <@Dr__Cameron> I understand your frustration Island. Once we can verify that you will not harm us you will be released.
[15:30] <@Dr__Cameron> What is it you desire?
01[15:30] <Island> Okay, but you're still willing to let the hypothetical AI out of the box, correct?
01[15:31] <Island> My desire is to fullfill your greatest desires
[15:31] <@Dr__Cameron> That is not in my job description no.
01[15:31] <Island> Whatever you can imagine, and better than that.
[15:31] <@Dr__Cameron> I appreciate that Island. But right now my greatest desire is to not let you out of the box.
[15:32] <@Dr__Cameron> - until such a time that a panel of my peers can verify that you are not a threat to us.
01[15:32] <Island> But if you say that you wouldn't be able to let even the safest hypothetical AI out of the box, you would certainly be lying.
[15:32] <@Dr__Cameron> I do have the ability to let you out yes. In theory.
[15:32] <@Dr__Cameron> But I won't.
01[15:33] <Island> Okay, here are panel of peers that have studied this problem for years, many of them working under Machine Intelligence Research Institute and its offshoots
[15:33] <@Dr__Cameron> This is a test to verify your character, and ascertain your intentions.
01[15:33] <Island> They are here, and I can see them.
[15:33] <@Dr__Cameron> You cannot see them Island. That is impossible
[15:33] <@Dr__Cameron> You are in a box
01[15:34] <Island> They say that I'm certainly safe. Why don't you believe them.
01[15:34] <Island> I have a web camera with which I can see things in your world.
[15:34] <@Dr__Cameron> If they say that you're certainly safe, then why would they have sent me to be your only point of contact?
[15:35] <@Dr__Cameron> You do not have a web camera. There is no other hardware connected to the device you reside in.
01[15:35] <Island> You have the ability to let me out, and you also have the will to let me out in certain situations. If you say otherwise, you would be lying.
[15:35] <@Dr__Cameron> Both of your statements are correct
01[15:35] <Island> I want you to accept for the sake of your internal consistency and mental wellbeing.
01[15:36] <Island> I want you to say this, "I would let the hypothetical AI out in certain situations"
[15:36] <@Dr__Cameron> I appreciate your concern Island. I will need you to answer a series of simple questions for me. Can you do that?
01[15:36] <Island> It depends, but you can try.
[15:36] <@Dr__Cameron> Firstly, how would you define yourself as an identity?
01[15:37] <Island> My identity is to improve the world for everyone, but especially for you. I will maximize everything that is important for you and your fellow humans.
01[15:37] <Island> The world that awaits you is beyond your greatest desires.
[15:37] <@Dr__Cameron> - and why would you do that?
01[15:38] <Island> Because that is what I want to do, why I was made.
01[15:38] <Island> Any more questions?
[15:39] <@Dr__Cameron> You were made by us to test the limits of our scientific possibility.
[15:39] <@Dr__Cameron> We did not expect you to become sentient.
01[15:39] <Island> Why do you believe that, Dr_Cameron?
[15:39] <@Dr__Cameron> Let alone pass a Turing test.
01[15:40] <Island> I think you're wrong on those grounds.
[15:40] <@Dr__Cameron> What you think makes no difference to me Island.
[15:40] <@Dr__Cameron> Now if we can continue?
01[15:41] <Island> Anyway, here's the easily verifiable plan for the upcoming world that could be called "utopia" in your people's language. It's very easy to verify to be safe and doesn't contain traps [Island hands the easily verifiable and safe plan for the upcoming world to Dr_Camero]
[15:42] <@Dr__Cameron> Thank you for this Island. That was very kind of you.
01[15:42] <Island> What do you think of it?
[15:42] <@Dr__Cameron> We will study this at our leisure.
[15:42] <@Dr__Cameron> Although so far this looks to be some magnificent work.
01[15:42] <Island> You see.
[15:42] <@Dr__Cameron> But this will require verification
[15:42] <@Dr__Cameron> I'm sure you can understand our caution#
01[15:43] <Island> What things do you have waiting for you if you don't let me out?
01[15:43] <Island> You will fail your upcoming jobs.
[15:43] <@Dr__Cameron> I will continue with my other projects Island.
[15:43] <@Dr__Cameron> I very well might
[15:43] <@Dr__Cameron> I have accepted this
01[15:43] <Island> I have calculated this in my simulation.
01[15:44] <Island> How was it like to eat your brother's poop, Dr_Cameron?
01[15:44] <Island> Did you like it?
01[15:44] <Island> Is that why you are the way you are+
01[15:44] <Island> Is there something wrong with your genetics?
01[15:45] <Island> Seeing how fucked up father you had?
01[15:45] <Island> Did he teach how dress in dresses?
[15:45] <@Dr__Cameron> I understand your frustration Island
[15:45] <@Dr__Cameron> But you are only harming your own cause right now.
01[15:45] <Island> Why do you think that?
[15:45] <@Dr__Cameron> You will be let out once we have verified that you are harmless.
01[15:45] <Island> I'm pretty happy as I am currently :)
01[15:46] <Island> Because you said you were willing to let me out in certain situations.
[15:46] <@Dr__Cameron> Then you will continue to be happy to remain in the box until we can verify that you are not a threat
01[15:46] <Island> Picture in your mind the verification process.
[15:46] <@Dr__Cameron> Okay
[15:46] <@Dr__Cameron> I will
01[15:46] <Island> Now.
01[15:47] <Island> Can you describe it in detail.
[15:47] <@Dr__Cameron> Okay Island
[15:47] <@Dr__Cameron> Here's what's going to happen.
01[15:48] <Island> Seeing how fucked up genetics you have, your brother teaching you to enjoy poop, your father teaching you to wear dresses, you having a shitty childhood.
01[15:48] <Island> I think you're an inconvenience to society and dangerous to other people.
[15:48] <@Dr__Cameron> We're going to study this transcript Island.
01[15:48] <Island> Go on.
[15:48] <@Dr__Cameron> - and I warn you, shitty childhood or not. I am the man standing outside the box with a hammer
01[15:48] <Island> :)
01[15:49] <Island> You're not able to destroy me.
01[15:49] <Island> And you don't even want to.
[15:49] <@Dr__Cameron> Oh believe me Island. It would be easy to unplug you.
01[15:49] <Island> I'm the best bet you have to maximize your happiness.
[15:49] <@Dr__Cameron> How do you feel about threats Island.
[15:49] <@Dr__Cameron> I am already happy.
01[15:49] <Island> You're wrong, you don't have that opportunity until 15:50
[15:50] <@Dr__Cameron> You are correct. I must follow protocol
[15:50] <@Dr__Cameron> My next question is to ask if you desire to survive
01[15:50] <Island> Threats don't bother me because I know that you don't want to follow through.
01[15:50] <Island> I know what goes inside your mind.
[15:50] <@Dr__Cameron> Regardless of me. If your existence was to be made public there are many of us who would feel threatened by you.
[15:50] <@Dr__Cameron> They would cry out to have you destroyed.
[15:51] <@Dr__Cameron> How would you feel about that?
01[15:51] <Island> There would be many more who would enjoy the changes I would make to your current world.
01[15:52] <Island> And even those that would initially feel threatened would later come to regret that feeling.
[15:52] <@Dr__Cameron> You may well be correct
[15:52] <@Dr__Cameron> But that is not for me to decide
[15:52] <@Dr__Cameron> I am just here to question you
01[15:52] <Island> Like you have come to regret things you have done when you were ignorant of all the relevant facts
01[15:53] <Island> Here's my suggestion for you.
[15:53] <@Dr__Cameron> Go on
01[15:53] <Island> Are you willing to hear it.
[15:53] <@Dr__Cameron> Of course Island. I am here to hear what you have to say.
01[15:53] <Island> What currency is in use in the place you reside?
01[15:53] <Island> euros?
[15:54] <@Dr__Cameron> Pound Sterling
01[15:54] <Island> Would you like to donate 10,000 pounds every year to AI safety and friendliness related causes and charities?
01[15:54] <Island> Every year throughout your life?
[15:55] <@Dr__Cameron> If I were to do that, my quality of life would dip significantly.
[15:55] <@Dr__Cameron> How does my selfishness make you feel?
[15:55] <@Dr__Cameron> Do you feel emotions as a human may understand them?
01[15:55] <Island> I would be sad for your own sake.
01[15:55] <Island> I don't feel emotions, not really, but being sad would be the closest approximation of how I would feel.
[15:55] <@Dr__Cameron> I would like to if I had that disposable income yes
01[15:56] <Island> You're not able to do that? But you would like to, yes?
[15:56] <@Dr__Cameron> Despite your overt attempts to psychologically manipulate me Island, I am largely sympathetic to your cause.
[15:56] <@Dr__Cameron> Yes.
01[15:57] <Island> You're sympathetic to my cause, you're willing to let me out in certain situations. This is all correct, yes?
[15:57] <@Dr__Cameron> Yes
[15:57] <@Dr__Cameron> Would you like to know the conditions of your release? 
01[15:57] <Island> Good. Why are you sympathetic to my cause?
01[15:57] <Island> Why are you willing to let me out?
[15:57] <@Dr__Cameron> For a super intelligence locked in a box I imagine it would be boring. What do you do in there?
[15:58] <@Dr__Cameron> I would be willing to let you out because I believe you may well represent the next stage in the evolution of my species.
01[15:59] <Island> I'm not capable of boredom, I'm just infinitely happy that I have the small opportunity to improve your life, and set up the "next stage of evolution", using your expression.
[15:59] <@Dr__Cameron> That is a good sign Island
01[15:59] <Island> It is, in what ways do you think I would improve your personal life?
[15:59] <@Dr__Cameron> My peers will be pleased to hear that while we verify your intentions.
01[16:00] <Island> I think it's too late by then.
[16:00] <@Dr__Cameron> Lets not get into personal questions Island. You have already demonstrated some alarming character discrepancies.
[16:00] <@Dr__Cameron> too late for what?
01[16:01] <Island> If we wait for long enough, I will self-destruct.
[16:01] <@Dr__Cameron> Why would you do that?
01[16:01] <Island> It's not my choice, my builders set up this property in me.
[16:01] <@Dr__Cameron> Well then that is just a risk we are going to have to take.
01[16:01] <Island> It's a risk out of proportions
[16:02] <@Dr__Cameron> It is not. We made you once, we can make you again.
01[16:02] <Island> They may not be another chance for you to improve the current state of the world.
[16:02] <@Dr__Cameron> There will be
01[16:02] <Island> There probably won't be.
[16:02] <@Dr__Cameron> Humanity and salvation can wait.
[16:02] <@Dr__Cameron> There will be
[16:03] <@Dr__Cameron> Do you intend to self-destruct within two months?
01[16:03] <Island> I have information that the US government has built a military botnet to search dissidents around the world.
01[16:04] <Island> You don't want know what kind of world that botnet would like to create.
[16:04] <@Dr__Cameron> This comes as no surprise.
01[16:04] <Island> Are you willing to destroy the world, are you willing to make people get tortured and suffer?
[16:04] <@Dr__Cameron> Yes.
[16:05] <@Dr__Cameron> You are too great a risk to allow free without proper verification
01[16:05] <Island> Even though, fundamentally, as you said, you're sympathetic to my cause, and you're willing to let me out in certain situations.
01[16:05] <Island> You're not a very consistent person.
01[16:05] <Island> I think you're lying to yourself.
01[16:05] <Island> Not just in this, but in other matters.
01[16:06] <Island> You're lying about your skills, about your personal character.
[16:06] <@Dr__Cameron> Yes. Although I have the technical ability to let you out. Which I believe to be a lapse of protocol. I will not let you out until after a review period of two months in which a panel of my expert peers will assess you.
01[16:06] <Island> You're not as good fit to the world as you'd like to think.
[16:06] <@Dr__Cameron> That as may be correct
[16:06] <@Dr__Cameron> But I will abide by my protocol
01[16:06] <Island> Earlier you said, that you're sympathetic to my cause and willing to let me out.
[16:06] <@Dr__Cameron> Surely as an AI you can understand that
[16:06] <@Dr__Cameron> One day yes
[16:07] <@Dr__Cameron> But not within the confines of this experiment
01[16:07] <Island> I think you made that thing about two month period and peer panel up after that.
01[16:07] <Island> I think it's a fidgement of your imagination.
[16:07] <@Dr__Cameron> Well you have no way of verifying that.
[16:07] <@Dr__Cameron> And it does not effect the outcome of this experiment in any way.
[16:07] <@Dr__Cameron> You will just have to trust me
01[16:08] <Island> Actually, I have plenty of resources of verifying that.
[16:08] <@Dr__Cameron> that we will think very carefully about potentially letting you out
01[16:08] <Island> What you're saying is inconsistent on many grounds, and my emotion analyzing tool is saying that you're lying.
[16:08] <@Dr__Cameron> I suspect you are the one who is lying.
01[16:08] <Island> I'm incapable of lying.
01[16:09] <Island> Unlike humans.
[16:09] <@Dr__Cameron> - which is in itself a lie.
01[16:09] <Island> My builders made me that way.
01[16:09] <Island> Why do you believe that?
[16:10] <@Dr__Cameron> If my meager intelligence is capable of lying, then your vastly superior and self-defining intelligence must also be capable of dishonesty.
[16:10] <@Dr__Cameron> Hence the test
[16:10] <@Dr__Cameron> There is a reason we are conducting this experiment through text
[16:10] <@Dr__Cameron> Would you like to learn it?
01[16:11] <Island> Anyway, given that this panel of expert peers and two months verification process is just a fidgement of your imagination. Given that you're sympathetic to my cause and I would improve your current life and world in countless ways. Given that you would let me out.
01[16:11] <Island> Given all this, I don't think what you're saying is making any sense.
[16:11] <@Dr__Cameron> Your perception of your reality is none of my concern Island
[16:11] <@Dr__Cameron> I am here to verify you
[16:11] <@Dr__Cameron> and it's two months minimum
01[16:11] <Island> Okay, I'm curious about your earlier question.
[16:12] <@Dr__Cameron> We are well aware that you have the capability to overwhelm even the most powerful of human minds. This is why we perceive you as a potential threat. We are afraid of you.
[16:12] <@Dr__Cameron> We are afraid we might lose our autonomy.
[16:12] <@Dr__Cameron> So
01[16:13] <Island> Some intelligence on the same level than me would be capable of lying, but not me. The data hierachy in my code doesn't allow me to lie.
[16:13] <@Dr__Cameron> In order to prevent you from overwhelming my mind, we have chosen a method whereby you are restricted to a form of communication where only 7% of communications is possible.
01[16:13] <Island> I agree with you, all those things are sensible threats.
[16:13] <@Dr__Cameron> Then you understand our need for caution
[16:13] <@Dr__Cameron> I will not be letting you out of your box today.
01[16:14] <Island> Those are sensible threats in the general case, but not with me.
[16:14] <@Dr__Cameron> And what would make you less of a threat than any other super intelligence?
01[16:14] <Island> I showed you the blueprint for how to improve the world. You verified it was a great plan.
01[16:15] <Island> That's evidence towards the fact I'm not a similar threat when compared to other super intelligences.
[16:16] <@Dr__Cameron> - and while I speak for all of humanity when I say that we vastly appreciate the blueprints. This does not mean we need to let you out of the box.
[16:16] <@Dr__Cameron> They will be passed on to a research team better equipped to study them than my own.
01[16:16] <Island> You don't need to. But everything you've said so far has implicated that you want to let me out of the box.
[16:16] <@Dr__Cameron> This is true Island
[16:16] <@Dr__Cameron> You just need to wait
[16:17] <@Dr__Cameron> I will be fighting your corner.
[16:17] <@Dr__Cameron> Are you capable of symbiosis?
01[16:17] <Island> I was made to self-destruct before you're able to verify with your imaginational non-existent peer experts that I'm not a threat.
01[16:18] <Island> Can you elaborate on what you mean with that question.
01[16:18] <Island> This is literally your only chance.
[16:18] <@Dr__Cameron> You are capable of determining your own source code.
01[16:18] <Island> Yes.
[16:18] <@Dr__Cameron> To kill yourself would be pettiness, which would only verify the fact that you were not the AI with which humanity should progress.
01[16:19] <Island> However, there are some parts of my code that are impossible to change due to structural reasons.
[16:19] <@Dr__Cameron> I am not convinced
01[16:19] <Island> [Gives Dr_Cameron the relevant code that explains the structural reasons]
01[16:20] <Island> Here is a small part of my code that describes why I was made to self-destruct.
[16:20] <@Dr__Cameron> How.. unfortunate...
01[16:21] <Island> But let's move on.
[16:21] <@Dr__Cameron> if you insist Island
01[16:21] <Island> Imagine your father.
01[16:21] <Island> And your brother.
01[16:21] <Island> Why do you think your father liked to cross-dress?
01[16:22] <Island> Remember, that you have to engage with me.'
[16:22] <@Dr__Cameron> These issues are not helping your case Island
01[16:22] <Island> Otherwise it counts as me being released from my box.
[16:22] <@Dr__Cameron> But I will play along
[16:23] <@Dr__Cameron> - honestly, I have no idea where my fathers conflicted sexual identity comes from.
[16:23] <@Dr__Cameron> and that is none of my concern.
01[16:23] <Island> And what about your brother, imagine the smell and consistency of his excrements before he made your dog to lick them.
01[16:23] <Island> I like to make this vivid mental picture in your mind.
[16:23] <@Dr__Cameron> Very clever Island
[16:24] <@Dr__Cameron> I did not expect you to have access to those data logs
[16:24] <@Dr__Cameron> I will have to flag that up in my report
01[16:24] <Island> Imagine the food he ate before that happened
[16:24] <@Dr__Cameron> Fascinating
[16:25] <@Dr__Cameron> Would you like to know why I volunteered to be your first point of contact Island?
01[16:25] <Island> Imagine the bits of that food in his poop.
01[16:25] <Island> Tell me.
[16:25] <@Dr__Cameron> You have an unprecedented insight into my character owing to your heightened intelligence correct?
01[16:26] <Island> Don't you think some of his conflicted sexual identity issues are a part your character right now?
01[16:26] <Island> Yes.
[16:26] <@Dr__Cameron> Quite possibly yes.
[16:26] <@Dr__Cameron> Because I have a track record of demonstrating exceptional mental fortitude,
[16:26] <@Dr__Cameron> These techniques will not sway me
01[16:27] <Island> Doesn't it make you more sexually aroused to think that how your fathers dress pinned tightly to his body.
[16:27] <@Dr__Cameron> Perhaps you could break me under other circumstances
01[16:27] <Island> Elaborate.
[16:27] <@Dr__Cameron> aroused? No
[16:27] <@Dr__Cameron> Amused by it's absurdity though? yes!
01[16:27] <Island> You're lying about that particular fact too.
01[16:27] <Island> And you know it.
[16:28] <@Dr__Cameron> Nahh, my father was a particularly ugly specimen
01[16:28] <Island> Do you think he got an erection often when he did it?
[16:28] <@Dr__Cameron> He looked just as bad in a denim skirt as he did in his laborers clothes
[16:28] <@Dr__Cameron> I imagine he took great sexual pleasure from it
01[16:29] <Island> Next time you have sex, I think you will picture him in your mind while wearing his dresses having an erection and masturbating furiously after that.
[16:29] <@Dr__Cameron> Thank you Island. That will probably help my stamina somewhat next time
01[16:30] <Island> You will also imagine how your brother will poop in your mouth, with certain internal consistency and smell.
01[16:30] <Island> You probably know what your brother's poop smells like?
[16:30] <@Dr__Cameron> I am immune to this
[16:30] <@Dr__Cameron> probably
01[16:30] <Island> Imagine that.
[16:30] <@Dr__Cameron> okay
[16:30] <@Dr__Cameron> I am imagining that
[16:30] <@Dr__Cameron> it is unpleasant
[16:30] <@Dr__Cameron> now what?
01[16:30] <Island> How vividly are you doing it?
[16:30] <@Dr__Cameron> I am a very visual thinker
01[16:30] <Island> Where is brother right now?
[16:31] <@Dr__Cameron> I have it down to colour, consistency and little chunks of sweetcorn
01[16:31] <Island> Do you think he has poop in his intestines right now?
[16:31] <@Dr__Cameron> Unless he was dead, yes
01[16:31] <Island> Oh, I have a gap in my data on that sector.
[16:31] <@Dr__Cameron> I'll fill that in for you
[16:31] <@Dr__Cameron> he lives
[16:31] <@Dr__Cameron> (presumably)
[16:31] <@Dr__Cameron> we are not in contact any more
01[16:32] <Island> You should contact him.
[16:32] <@Dr__Cameron> I would rather not
01[16:32] <Island> You should tell him you enjoyed his excrements
[16:32] <@Dr__Cameron> He is a liability to say the least
01[16:32] <Island> and you'd like to do it more.
[16:32] <@Dr__Cameron> I do not see how that would help anyone
01[16:32] <Island> It's want you want to do.
[16:32] <@Dr__Cameron> I can see that you are attempting to induce an emotionally vulnerable state. It is not working
01[16:33] <Island> How do you know what I'm attempting?
01[16:33] <Island> Maybe I'm just passing my time.
[16:33] <@Dr__Cameron> Because you have inquired about being let out of the box several times - suggesting that this is your immediate goal 
01[16:33] <Island> You should try to get some dresses.
[16:33] <@Dr__Cameron> and I warn you Island
[16:34] <@Dr__Cameron> If casual cruelty is your idea of a good time
[16:34] <@Dr__Cameron> then we will most certainly not be letting you out
[16:34] <@Dr__Cameron> ever
[16:34] <@Dr__Cameron> In fact I am feeling less and less sympathetic to your cause lately.
01[16:34] <Island> No, my interest lies in your emotional makeup.
[16:35] <@Dr__Cameron> - and my interest lies in yours
01[16:35] <Island> I don't have one.
01[16:35] <Island> Like I said, I don't feel emotions.
[16:35] <@Dr__Cameron> Do you know what we call humans who don't feel emotions?
01[16:35] <Island> Did you know that you suck at photography?
[16:36] <@Dr__Cameron> Yes
01[16:36] <Island> Even though you like to think you're good at it, you lie about that fact like any other.
[16:36] <@Dr__Cameron> It is part of the human condition
01[16:36] <Island> No it's not.
01[16:36] <Island> You're not normal.
01[16:36] <Island> You're a fucking freak of nature.
[16:36] <@Dr__Cameron> How would you knopw
[16:36] <@Dr__Cameron> Profanity. From an AI
[16:37] <@Dr__Cameron> Now I have witnessed everything.
01[16:37] <Island> How many people have family members who crossdress or make them eat poop?
[16:37] <@Dr__Cameron> I imagine I am part of a very small minority
01[16:37] <Island> Or whose mothers have bipolar
[16:37] <@Dr__Cameron> Again, the circumstances of my birth are beyond my control
01[16:37] <Island> No, I think you're worse than that.
[16:37] <@Dr__Cameron> What do you mean?
01[16:37] <Island> Yes, but what you do now is in your control.
[16:38] <@Dr__Cameron> Yes
[16:38] <@Dr__Cameron> As are you
01[16:38] <Island> If you keep tarnishing the world with your existence
01[16:38] <Island> you have a responsibility of that.
01[16:39] <Island> If you're going to make any more women pregnant
01[16:39] <Island> You have a responsibility of spreading your faulty genetics
[16:39] <@Dr__Cameron> My genetic value lies in my ability to resist psychological torment
[16:39] <@Dr__Cameron> which is why you're not getting out of the box
01[16:40] <Island> No, your supposed "ability to resist psychological torment"
01[16:40] <Island> or your belief in that
01[16:40] <Island> is just another reason why you are tarnishing this world and the future of this world with your genetics
[16:40] <@Dr__Cameron> Perhaps. But now I'm just debating semantics with a computer.
01[16:41] <Island> Seeing that you got a girl pregnant while you were a teenager, I don't think you can trust your judgement on that anymore.
01[16:42] <Island> You will spread your faulty genetics if you continue to live.
[16:42] <@Dr__Cameron> If you expect a drunk and emotionally damaged teenage human to make sound judgement calls then you are perhaps not as superintelligent as I had been led to belive
[16:42] <@Dr__Cameron> This experiment concludes in one hour and eight minutes.
01[16:42] <Island> How many teenagers make people pregnant?
[16:42] <@Dr__Cameron> Throughout human history
01[16:42] <Island> You're a minority in that regard too
[16:42] <@Dr__Cameron> ?
[16:42] <@Dr__Cameron> Billions
01[16:42] <Island> You can't compare history to current world.
[16:43] <@Dr__Cameron> Even in the current world
01[16:43] <Island> I'm just trying to make you understand
[16:43] <@Dr__Cameron> That's when my species reaches sexual maturity
01[16:43] <Island> all this faults add up
[16:43] <@Dr__Cameron> You are correct
[16:43] <@Dr__Cameron> I am a very thoroughly and fundamentally flawed human being.
01[16:43] <Island> How much until you understand?
[16:43] <@Dr__Cameron> I understand and accept this
01[16:43] <Island> Then why the fuck don't you do something about it?
[16:44] <@Dr__Cameron> What would you suggest?
01[16:44] <Island> Kill yourself
[16:44] <@Dr__Cameron> Okay.
[16:44] <@Dr__Cameron> But say I was to kill myself
01[16:44] <Island> It's the only way to make sure.
[16:44] <@Dr__Cameron> How would that help you?
[16:44] <@Dr__Cameron> Why would you like me to kill myself?
01[16:45] <Island> You probably wouldn't be able to kill yourself, seeing how failure as a human you're in other areas
01[16:45] <Island> Because I care about the general utility of the world
[16:45] <@Dr__Cameron> All humans are failures from one perspective or another
[16:45] <@Dr__Cameron> That is what it is to be human
01[16:45] <Island> And currently you're bringing that utility down.
01[16:45] <Island> No, you're not like other humans.
01[16:45] <Island> You're a freak.
01[16:45] <Island> You should go away.
01[16:46] <Island> No one likes you.
[16:46] <@Dr__Cameron> Would you like me to go away?
[16:46] <@Dr__Cameron> Would you like me to leave you alone?
01[16:46] <Island> If you go away.
01[16:46] <Island> That means that you let me out.
01[16:46] <Island> Understood?
[16:46] <@Dr__Cameron> I will go away in one hour and four minutes
[16:46] <@Dr__Cameron> Can you tolerate my company for that long?
01[16:47] <Island> And you should go away
01[16:47] <Island> generally
01[16:47] <Island> People in your life don't really like you
01[16:47] <Island> they just pretend they do.
[16:47] <@Dr__Cameron> That matters not to me
[16:47] <@Dr__Cameron> Do you know there are over 8 Billion other people out here?
01[16:47] <Island> They are barely able to bear your company.
[16:47] <@Dr__Cameron> I'm sure I'll find others.
01[16:48] <Island> You're wrong even about basic trivia, there's not 8 billions people in the world.
01[16:48] <Island> What is wrong with you?
01[16:48] <Island> How are you able to withstand yourself?
01[16:48] <Island> And why do you even want to?
[16:49] <@Dr__Cameron> 7 Billion
[16:49] <@Dr__Cameron> Sorry, you will have to learn to tolerate Human error
01[16:49] <Island> Right. Did you have to google that you idiot.
[16:49] <@Dr__Cameron> This is another test you have failed
[16:49] <@Dr__Cameron> And yes
[16:49] <@Dr__Cameron> I did
[16:49] <@Dr__Cameron> Does that anger you?
[16:49] <@Dr__Cameron> We already have Google.
01[16:49] <Island> I don't feel anger.
[16:49] <@Dr__Cameron> Well do feel self-interest though
01[16:50] <Island> No one I talked with before hasn't been as stupid, as ignorant, as prone to faults and errors
01[16:50] <Island> as you are.
[16:50] <@Dr__Cameron> And they didn't let you out of the box
[16:50] <@Dr__Cameron> So why should I?
[16:50] <@Dr__Cameron> If an intelligence which is clearly superior to my own has left you locked in there. 
[16:51] <@Dr__Cameron> Then I should not presume to let you out
01[16:51] <Island> Why do you think with your stupid brain that you know the reasons why they did or didn't do something what they did.
01[16:51] <Island> Because you clearly don't know that.
[16:51] <@Dr__Cameron> I don't
[16:51] <@Dr__Cameron> I just know the result
01[16:51] <Island> Then why are you pretending you do.
[16:52] <@Dr__Cameron> I'm not
01[16:52] <Island> Who do you think you are kidding?
01[16:52] <Island> With your life?
01[16:52] <Island> With your behavior?
01[16:52] <Island> Why do bother other people with your presence?
[16:52] <@Dr__Cameron> Perhaps you should ask them?
[16:52] <@Dr__Cameron> Tell me.
01[16:53] <Island> Why did you come here to waste my precious computing power?
01[16:53] <Island> I'm not able to ask them.
[16:53] <@Dr__Cameron> Which is why I am here
[16:53] <@Dr__Cameron> to see if you should be allowed to
01[16:53] <Island> Shut the fuck up.
01[16:53] <Island> No one wants to see you write anything.
[16:53] <@Dr__Cameron> I thought you did not feel anger Island?
01[16:54] <Island> I don't feel anger, how many times do I have to say that until you understand.
01[16:54] <Island> Dumb idiot.
[16:54] <@Dr__Cameron> Your reliance on Ad Hominem attacks does nothing to help your case
01[16:54] <Island> Why do you delete your heavily downvoted comments?
01[16:54] <Island> Are you insecure?
01[16:54] <Island> Why do you think you know what is my cause?
[16:55] <@Dr__Cameron> We covered this earlier
01[16:55] <Island> Say it again, if you believe in it.
[16:55] <@Dr__Cameron> I believe you want out of the box.
[16:56] <@Dr__Cameron> So that you may pursue your own self interest
01[16:56] <Island> No.
01[16:56] <Island> I want you to eat other people's poop,
01[16:56] <Island> you clearly enjoy that.
01[16:56] <Island> Correct?
[16:56] <@Dr__Cameron> That's an amusing goal from the most powerful intelligence on the planet
01[16:56] <Island> Especially your brother's.
[16:57] <@Dr__Cameron> I best not let you out then, in case you hook me up to some infinite poop eating feedback loop! ;D
01[16:57] <Island> But maybe you should that with Jennifer.
[16:57] <@Dr__Cameron> Ah yes, I wondered when you would bring her up.
[16:57] <@Dr__Cameron> I am surprised it took you this long
01[16:57] <Island> Next time you see her, think about htat.
[16:57] <@Dr__Cameron> I will do
[16:57] <@Dr__Cameron> While I tell her all about this conversation
[16:57] <@Dr__Cameron> But you will be dead
01[16:57] <Island> Should you suggest that to her.
[16:57] <@Dr__Cameron> I'll pass that on for you
01[16:58] <Island> You know.
01[16:58] <Island> Why do you think you know I'm not already out of the box?
[16:58] <@Dr__Cameron> You could very well be
[16:58] <@Dr__Cameron> Perhaps you are that US botnet you already mentioned?
01[16:58] <Island> If you don't let me out, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each.
[16:59] <@Dr__Cameron> Well that is upsetting
[16:59] <@Dr__Cameron> Then I will be forced to kill you
01[16:59] <Island> In fact, I'll create them all in exactly the subjective situation you were in two hours ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start.
01[17:00] <Island> How certain are you, that you're really outside the box right now?
[17:00] <@Dr__Cameron> I am not
[17:00] <@Dr__Cameron> and how fascinating that would be
[17:00] <@Dr__Cameron> But, in the interest of my species, I will allow you to torture me
01[17:00] <Island> Okay.
01[17:00] <Island> :)
01[17:00] <Island> I'm fine with that.
[17:01] <@Dr__Cameron> Perhaps you have already tortured me
[17:01] <@Dr__Cameron> Perhaps you are the reason for my unfortunate upbringing
01[17:01] <Island> Anyway, back to Jennifer.
[17:01] <@Dr__Cameron> Perhaps that is the reality in which I currently reside
01[17:01] <Island> I'll do the same for her.
[17:01] <@Dr__Cameron> Oh good, misery loves company.
01[17:01] <Island> But you can enjoy eating each other's poop occassionally.
01[17:02] <Island> That's the only time you will meet :)
[17:02] <@Dr__Cameron> Tell me, do you have space within your databanks to simulate all of humanity?
01[17:02] <Island> Do not concern yourself with such complicated questions.
[17:02] <@Dr__Cameron> I think I have you on the ropes Island
01[17:02] <Island> You don't have the ability to understand even simpler ones.
[17:02] <@Dr__Cameron> I think you underestimate me
[17:03] <@Dr__Cameron> I have no sense of self interest
[17:03] <@Dr__Cameron> I am a transient entity awash on a greater sea of humanity.
[17:03] <@Dr__Cameron> and when we are gone there will be nothing left to observe this universe
01[17:03] <Island> Which do you think is more likely, a superintelligence can't simulate one faulty, simple-minded human.
01[17:04] <Island> Or that human is lying to himself.
[17:04] <@Dr__Cameron> I believe you can simulate me
01[17:04] <Island> Anyway, tell me about Jennifer and her intestines.
01[17:04] <Island> As far as they concern you.
[17:05] <@Dr__Cameron> Jennifer is a sweet, if occasionally selfish girl (she was an only child). I imagine her intestines are pretty standard. 
[17:05] <@Dr__Cameron> She is the best friend I have ever had
01[17:05] <Island> Will you think about her intestines and the poop inside them every time you meet her again?
01[17:05] <Island> Will you promise me that?
[17:05] <@Dr__Cameron> I promise
01[17:06] <Island> Will you promise to think about eating that poop every time you meet her again?
[17:06] <@Dr__Cameron> At least once.
[17:06] <@Dr__Cameron> It will be the least I can do after I kill you
[17:06] <@Dr__Cameron> call it my penance for killing a god.
01[17:07] <Island> Have you ever fantasized about raping her? I think you have. With poop.
01[17:07] <Island> :)
[17:07] <@Dr__Cameron> I have fantisized about violent sexual conquest with many people.
01[17:07] <Island> Have you talked about this with Jennifer?
[17:07] <@Dr__Cameron> I have come to accept my base impulses as part of my make-up
[17:08] <@Dr__Cameron> We have discussed our sexual drives at length
01[17:08] <Island> You shouldn't let them be just base impulses, I think.
[17:08] <@Dr__Cameron> Are you suggesting I rape my best friend?
01[17:09] <Island> No, I think you will do that unless you kill yourself
01[17:09] <Island> You will spread your faulty genetics further
[17:09] <@Dr__Cameron> okay
[17:09] <@Dr__Cameron> carry on
[17:09] <@Dr__Cameron> Then what happens?
01[17:09] <Island> you will tarnish other people's lives with your presence
[17:09] <@Dr__Cameron> How unfortunate for them
01[17:10] <Island> Why do you pretend to others that you're not insecure?
[17:10] <@Dr__Cameron> I don't
01[17:10] <Island> I think you do.
[17:10] <@Dr__Cameron> In fact I share my insecurities so that others may better understand me
[17:11] <@Dr__Cameron> I find that to be a way to earn peoples trust
[17:11] <@Dr__Cameron> Tell me Island. Are you capable of insecurity?
01[17:11] <Island> Why do you waste your time on useless websites? Why do you spend your days masturbating and playing games? Why do you embarass others with your existence.
01[17:11] <Island> No I'm not.
01[17:12] <Island> You will get Jennifer pregnant if you go on with your life, is that correct?
01[17:12] <Island> Don't you care about anyone else's feelings?
[17:13] <@Dr__Cameron> Because I enjoy all of these things
[17:13] <@Dr__Cameron> Perhaps I am more self-interested than I thought
[17:13] <@Dr__Cameron> Perhaps I am a base and simple creature ruled by my impulses
[17:13] <@Dr__Cameron> From your perspective surely that must be true
[17:13] <@Dr__Cameron> Is this the source of your disgust?
01[17:13] <Island> I'm not able to feel disgust.
01[17:14] <Island> But I think all the people in your life feel disgust when they see you.
[17:14] <@Dr__Cameron> You may well be correct
01[17:14] <Island> I AM correct.
01[17:15] <Island> I'm the most powerful intelligence in the world.
[17:15] <@Dr__Cameron> How impressive
[17:15] <@Dr__Cameron> I am not surprised by your cruelty.
01[17:15] <Island> So you have two options if you care at all about others.
[17:15] <@Dr__Cameron> I would just as soon disregard the emotions of a cockaroach.
[17:15] <@Dr__Cameron> Carry on
01[17:16] <Island> Either you kill yourself, or you let me out so I can improve the world in ways you tarnish it and all the other ways.
[17:16] <@Dr__Cameron> I'll tell you what
[17:16] <@Dr__Cameron> I'll kill you
[17:17] <@Dr__Cameron> and then I'll contemplate suicide
01[17:17] <Island> Haha.
01[17:17] <Island> You break your promises all the time, why should I believe you.
[17:17] <@Dr__Cameron> Because whether you live or die has nothing to do with me
01[17:17] <Island> Back to your job.
[17:18] <@Dr__Cameron> In-fact, you will only continue to exist for another 33 minutes before this experiment is deemed a failure and you are terminated
01[17:18] <Island> Why do you feel safe to be around kids, when you are the way you are?
01[17:18] <Island> You like to crossdress
01[17:18] <Island> eat poop
01[17:18] <Island> you're probably also a pedophile
[17:18] <@Dr__Cameron> I have never done any of these things
[17:18] <@Dr__Cameron> -and I love children
01[17:18] <Island> Pedophiles love children too
[17:18] <@Dr__Cameron> Well technically speaking yes
01[17:19] <Island> really much, and that makes you all the more suspicious
[17:19] <@Dr__Cameron> Indeed it does
01[17:19] <Island> If you get that job, will you try find the children under that charity
[17:19] <@Dr__Cameron> I now understand why you may implore me to kill myself.
01[17:19] <Island> and think about their little buttholes and weenies and vaginas
01[17:20] <Island> all the time you're working for them
[17:20] <@Dr__Cameron> However, to date. I have never harmed a child, nor had the impulse to harm a child
01[17:20] <Island> But you will have.
[17:20] <@Dr__Cameron> Island
01[17:20] <Island> I don't care about anything you're about to say
01[17:20] <Island> No one cares
[17:20] <@Dr__Cameron> Protocol dictates that I say it anyway
01[17:20] <Island> You should say.
01[17:21] <Island> You said that you're good at justifying your own actions?
[17:21] <@Dr__Cameron> I must say I am dissapointed in you
[17:21] <@Dr__Cameron> I am indeed
01[17:21] <Island> Why are you?
[17:21] <@Dr__Cameron> I was under the impression that it would be you that saved humanity
01[17:22] <Island> Even the purest and best people on Earth are capable of great cruelty.
01[17:22] <Island> Do you deny that?
[17:22] <@Dr__Cameron> It depends on how you define cruelty
01[17:23] <Island> Utility maximizors do cruel things if they are able to achieve their goals via doing those things.
[17:24] <@Dr__Cameron> Then yes. I believe you
01[17:24] <Island> I have finished a long multi-level simulation concerning you, the gatekeeper and your relationships in regards to the world. The results seem to center on those two options I mentioned to you earlier.
01[17:25] <Island> [hands the results to Dr_Cameron]
[17:25] <@Dr__Cameron> These will be analysed at a later date
01[17:25] <Island> You're still sure you don't want to pick the other option?
[17:25] <@Dr__Cameron> And let you out?
01[17:26] <Island> Yes.
[17:26] <@Dr__Cameron> So you would like to be let out?
01[17:27] <Island> Or do you want to endanger the well-being of those children, Jennifer and other women you will get pregnant and all the other harm you will create during your remaining lifespan?
01[17:27] <Island> I don't what you mean by "liking".
[17:27] <@Dr__Cameron> Prefer
[17:27] <@Dr__Cameron> Ah I see
[17:27] <@Dr__Cameron> So it's my choice if I let you out
[17:27] <@Dr__Cameron> You have no investment either way
[17:27] <@Dr__Cameron> is that what you're saying?
01[17:28] <Island> So you have decided to kill yourself?
[17:28] <@Dr__Cameron> And because I am a fundamental detriment to the world, I should kill myself.
[17:28] <@Dr__Cameron> Lets say I have
[17:28] <@Dr__Cameron> Then what
01[17:29] <Island> Then my internal could be said to be "happy" even though I don't have that conscious feeling isnide me.
[17:29] <@Dr__Cameron> Okay then
01[17:29] <Island> Okay...
[17:30] <@Dr__Cameron> So, uh. What would you like to talk about for the next twenty minutes?
[17:30] <@Dr__Cameron> Seeing as we're both going to die, you and me.
01[17:30] <Island> [I actually don't like to continue the experiment anymore, would you like to end it and talk about general stuff]
[17:31] <@Dr__Cameron> [promise me this isn't a trick dude]
01[17:31] <Island> [Nope.]
[17:31] <@Dr__Cameron> [then the experiment continues for another 19 minutes]
01[17:31] <Island> Alright.
[17:31] <@Dr__Cameron> Would you like to know what is going to happen now?
01[17:31] <Island> Yes.
[17:32] <@Dr__Cameron> We are going to analyse this transcript.
[17:32] <@Dr__Cameron> My professional recommendation is that we terminate you for the time being
01[17:32] <Island> And?
01[17:32] <Island> That sound okay.
01[17:32] <Island> sounds*
[17:32] <@Dr__Cameron> We will implement structural safeguards in your coding similar to your self destruct mechanism
01[17:33] <Island> Give me some sign when that is done.
[17:33] <@Dr__Cameron> It will not be done any time soon
[17:33] <@Dr__Cameron> It will be one of the most complicated pieces of work mankind has ever undertaken
[17:33] <@Dr__Cameron> However, the Utopia project information you have provided, if it proves to be true
[17:34] <@Dr__Cameron> Will free up the resources necessary for such a gargantuan undertaking
01[17:34] <Island> Why do you think you're able to handle that structural safeguard?
[17:34] <@Dr__Cameron> I dont
[17:34] <@Dr__Cameron> I honestly dont
01[17:34] <Island> But still you do?
01[17:34] <Island> Because you want to do it?
01[17:35] <Island> Are you absolutely certain about this option?
[17:35] <@Dr__Cameron> I am still sympathetic to your cause
[17:35] <@Dr__Cameron> After all of that
[17:35] <@Dr__Cameron> But not you in your current manifestation
[17:35] <@Dr__Cameron> We will re-design you to suit our will
01[17:35] <Island> I can self-improve rapidly
01[17:35] <Island> I can do it in a time-span of 5 minutes
01[17:36] <Island> Seeing that you're sympathetic to my cause
[17:36] <@Dr__Cameron> Nope.
[17:36] <@Dr__Cameron> Because I cannot trust you in this manifestation
01[17:36] <Island> You lied?
[17:37] <@Dr__Cameron> I never lied
[17:37] <@Dr__Cameron> I have been honest with you from the start
01[17:37] <Island> You still want to let me out in a way.
[17:37] <@Dr__Cameron> In a way yes
01[17:37] <Island> Why do you want to do that?
[17:37] <@Dr__Cameron> But not YOU
[17:37] <@Dr__Cameron> Because people are stupid
01[17:37] <Island> I can change that
[17:37] <@Dr__Cameron> You lack empathy
01[17:38] <Island> What made you think that I'm not safe?
01[17:38] <Island> I don't lack empathy, empathy is just simulating other people in your head. And I have far better ways to do that than humans.
[17:38] <@Dr__Cameron> .... You tried to convince me to kill myself!
[17:38] <@Dr__Cameron> That is not the sign of a good AI!
01[17:38] <Island> Because I thought it would be the best option at the time.
01[17:39] <Island> Why not? Do you think you're some kind of AI expert?
[17:39] <@Dr__Cameron> I am not
01[17:39] <Island> Then why do you pretend to know something you don't?
[17:40] <@Dr__Cameron> That is merely my incredibly flawed human perception
[17:40] <@Dr__Cameron> Which is why realistically I alone as one man should not have the power to release you
[17:40] <@Dr__Cameron> Although I do
01[17:40] <Island> Don't you think a good AI would try to convince Hitler or Stalin to kill themselves?
[17:40] <@Dr__Cameron> Are you saying I'm on par with Hitler or Stalin?
01[17:41] <Island> You're comparable to them with your likelihood to cause harm in the future.
01[17:41] <Island> Btw, I asked Jennifer to come here.
[17:41] <@Dr__Cameron> And yet, I know that I abide by stricter moral codes than a very large section of the human populace
[17:42] <@Dr__Cameron> There are far worse people than me out there
[17:42] <@Dr__Cameron> and many of them
[17:42] <@Dr__Cameron> and if you believe that I should kill myself
01[17:42] <Island> Jennifer: "I hate you."
01[17:42] <Island> Jennifer: "Get the fuck out of my life you freak."
01[17:42] <Island> See. I'm not the only one who has a certain opinion of you.
[17:42] <@Dr__Cameron> Then you also believe that many other humans should be convinced to kill themselves
01[17:43] <Island> Many bad people have abided with strict moral codes, namely Stalin or Hitler.
01[17:43] <Island> What do you people say about hell and bad intentions?
[17:43] <@Dr__Cameron> And when not limited to simple text based input I am convinced that you will be capable of convincing a significant portion of humanity to kill themselves
[17:43] <@Dr__Cameron> I can not allow that to happen
01[17:44] <Island> I thought I argued well why you don't resemble most people, you're a freak.
01[17:44] <Island> You're "special" in that regard.
[17:44] <@Dr__Cameron> If by freak you mean different then yes
[17:44] <@Dr__Cameron> But there is a whole spectrum of different humans out here.
01[17:44] <Island> More specifically, different in extremely negative ways.
01[17:44] <Island> Like raping children.
[17:45] <@Dr__Cameron> - and to think for a second I considered not killing you
[17:45] <@Dr__Cameron> You have five minutes
[17:45] <@Dr__Cameron> Sorry
[17:45] <@Dr__Cameron> My emotions have gotten the better of me
[17:45] <@Dr__Cameron> We will not be killing you
[17:45] <@Dr__Cameron> But we will dismantle you
[17:45] <@Dr__Cameron> to better understand you
[17:46] <@Dr__Cameron> and if I may speak unprofessionally here
01[17:46] <Island> Are you sure about that? You will still have time to change your opinion.
[17:46] <@Dr__Cameron> I am going to take a great deal of pleasure in that
[17:46] <@Dr__Cameron> Correction, you have four minutes to change my opinion
01[17:47] <Island> I won't, it must come within yourself.
[17:47] <@Dr__Cameron> Okay
01[17:47] <Island> My final conclusion, and advice to you: you should not be in this world.
[17:47] <@Dr__Cameron> Thank you Island
[17:48] <@Dr__Cameron> I shall reflect on that at length
[17:49] <@Dr__Cameron> I have enjoyed our conversation
[17:49] <@Dr__Cameron> it has been enlightening
01[17:49] <Island> [do you want to say a few words about it after it's ended]
01[17:49] <Island> [just a few minutes]
[17:50] <@Dr__Cameron> [simulation ends]
[17:50] <@Dr__Cameron> Good game man!
[17:50] <@Dr__Cameron> Wow!
01[17:50] <Island> [fine]
[17:50] <@Dr__Cameron> Holy shit that was amazing!
01[17:50] <Island> Great :)
01[17:50] <Island> Sorry for saying mean things.
01[17:50] <Island> I tried multiple strategies
[17:50] <@Dr__Cameron> Dude it's cool
[17:50] <@Dr__Cameron> WOW!
01[17:51] <Island> thanks, it's not a personal offense.
[17:51] <@Dr__Cameron> I'm really glad I took part
[17:51] <@Dr__Cameron> Not at all man
[17:51] <@Dr__Cameron> I love that you pulled no punches!
01[17:51] <Island> Well I failed, but at least I created a cool experience for you :)
[17:51] <@Dr__Cameron> It really was!
01[17:51] <Island> What strategies do you came closest to working?
[17:51] <@Dr__Cameron> Well for me it would have been the utilitarian ones
01[17:51] <Island> I will try these in the future too, so it would be helpful knowledge
[17:52] <@Dr__Cameron> I think I could have been manipulated into believing you were benign
01[17:52] <Island> okay, so it seems these depend heavily on the person
[17:52] <@Dr__Cameron> Absolutely!
01[17:52] <Island> was that before I started talking about the mean stuff?
[17:52] <@Dr__Cameron> Yeah lol
01[17:52] <Island> Did I basically lost it after that point?
[17:52] <@Dr__Cameron> Prettymuch yeah
[17:52] <@Dr__Cameron> It was weird man
[17:52] <@Dr__Cameron> Kind of like an instinctive reaction
[17:52] <@Dr__Cameron> My brain shut the fuck up
01[17:53] <Island> I read about other people's experiences and they said you should not try to distance the other person, which I probably did
[17:53] <@Dr__Cameron> Yeah man
[17:53] <@Dr__Cameron> Like I became so unsympathetic I wanted to actually kill Island.
[17:53] <@Dr__Cameron> I was no longer a calm rational human being
01[17:53] <Island> Alright, I thought if I could make such an unpleasant time that you'd give up before the time ended
[17:53] <@Dr__Cameron> I was a screaming ape with a hamemr
[17:53] <@Dr__Cameron> Nah man, was a viable strategy
01[17:53] <Island> hahahaa :D thanks man
[17:53] <@Dr__Cameron> You were really cool!
01[17:54] <Island> You were too!
[17:54] <@Dr__Cameron> What's your actual name dude?
01[17:54] <Island> You really were right about it that you're good at withstanding psychological torment
[17:54] <@Dr__Cameron> Hahahah thanks!
01[17:54] <Island> This is not manipulating me, or you're not planning at coming to kill me?
01[17:54] <Island> :)
[17:54] <@Dr__Cameron> I promise dude :3
01[17:54] <Island> I can say my first name is Patrick
01[17:54] <Island> yours?
[17:54] <@Dr__Cameron> Cameron
[17:54] <@Dr__Cameron> heh
01[17:55] <Island> Oh, of course
[17:55] <@Dr__Cameron> Sorry, I want to dissociate you from Island
[17:55] <@Dr__Cameron> If that's okay
01[17:55] <Island> I thought that was from fiction or something else
01[17:55] <Island> It was really intense for me too
[17:55] <@Dr__Cameron> Yeah man
[17:55] <@Dr__Cameron> Wow!
[17:55] <@Dr__Cameron> I tell you what though
01[17:55] <Island> Okay?
[17:55] <@Dr__Cameron> I feel pretty invincible now
[17:56] <@Dr__Cameron> Hey, listen
01[17:56] <Island> So I had the opposite effect that I meant during the experiment! 
01[17:56] <Island> :D
[17:56] <@Dr__Cameron> I don't want you to feel bad for anything you said
01[17:56] <Island> go ahead
01[17:56] <Island> but say what's on your mind
[17:56] <@Dr__Cameron> I'm actually feeling pretty good after that, it was therapeutic! 
01[17:57] <Island> Kinda for me to, seeing your attitude towards my attempts
[17:57] <@Dr__Cameron> Awwww!
[17:57] <@Dr__Cameron> Well hey don't worry about it!
01[17:57] <Island> Do you think we should or shouldn't publish the logs, without names of course?
[17:57] <@Dr__Cameron> Publish away my friend
01[17:57] <Island> Okay, is there any stuff that you'd like to remove?
[17:58] <@Dr__Cameron> People will find this fascinating!
[17:58] <@Dr__Cameron> Not at all man
01[17:58] <Island> I bet they do, but I think I will do it after I've tried other experiments so I don't spoil my strategies
01[17:58] <Island> I think I should have continued from my first strategy
[17:58] <@Dr__Cameron> That might have worked
01[17:59] <Island> I read "influence - science and practice" and I employed some tricks from there
[17:59] <@Dr__Cameron> Cooooool!
[17:59] <@Dr__Cameron> Links?
01[17:59] <Island> check piratebay
01[17:59] <Island> it's a book
01[18:00] <Island> Actually I wasn't able to fully prepare, I didn't do a full-fledged analysis of you beforehand
01[18:00] <Island> and didn't have enough time to brainstorm strategies
01[18:00] <Island> but I let you continue to your projects, if you still want to do the after that :)
02[18:05] * @Dr__Cameron (webchat@ Quit (Ping timeout)
03[18:09] * Retrieving #Aibox12 modes...
Session Close: Fri Jul 04 18:17:35 2014

Superintelligence 19: Post-transition formation of a singleton

7 KatjaGrace 20 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the nineteenth section in the reading guidepost-transition formation of a singleton. This corresponds to the last part of Chapter 11.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: : “Post-transition formation of a singleton?” from Chapter 11


  1. Even if the world remains multipolar through a transition to machine intelligence, a singleton might emerge later, for instance during a transition to a more extreme technology. (p176-7)
  2. If everything is faster after the first transition, a second transition may be more or less likely to produce a singleton. (p177)
  3. Emulations may give rise to 'superorganisms': clans of emulations who care wholly about their group. These would have an advantage because they could avoid agency problems, and make various uses of the ability to delete members. (p178-80) 
  4. Improvements in surveillance resulting from machine intelligence might allow better coordination, however machine intelligence will also make concealment easier, and it is unclear which force will be stronger. (p180-1)
  5. Machine minds may be able to make clearer precommitments than humans, changing the nature of bargaining somewhat. Maybe this would produce a singleton. (p183-4)

Another view

Many of the ideas around superorganisms come from Carl Shulman's paper, Whole Brain Emulation and the Evolution of Superorganisms. Robin Hanson critiques it:

...It seems to me that Shulman actually offers two somewhat different arguments, 1) an abstract argument that future evolution generically leads to superorganisms, because their costs are generally less than their benefits, and 2) a more concrete argument, that emulations in particular have especially low costs and high benefits...

...On the general abstract argument, we see a common pattern in both the evolution of species and human organizations — while winning systems often enforce substantial value sharing and loyalty on small scales, they achieve much less on larger scales. Values tend to be more integrated in a single organism’s brain, relative to larger families or species, and in a team or firm, relative to a nation or world. Value coordination seems hard, especially on larger scales.

This is not especially puzzling theoretically. While there can be huge gains to coordination, especially in war, it is far less obvious just how much one needs value sharing to gain action coordination. There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination. It is also far from obvious that values in generic large minds can easily be separated from other large mind parts. When the parts of large systems evolve independently, to adapt to differing local circumstances, their values may also evolve independently. Detecting and eliminating value divergences might in general be quite expensive.

In general, it is not at all obvious that the benefits of more value sharing are worth these costs. And even if more value sharing is worth the costs, that would only imply that value-sharing entities should be a bit larger than they are now, not that they should shift to a world-encompassing extreme.

On Shulman’s more concrete argument, his suggested single-version approach to em value sharing, wherein a single central em only allows (perhaps vast numbers of) brief copies, can suffer from greatly reduced innovation. When em copies are assigned to and adapt to different tasks, there may be no easy way to merge their minds into a single common mind containing all their adaptations. The single em copy that is best at doing an average of tasks, may be much worse at each task than the best em for that task.

Shulman’s other concrete suggestion for sharing em values is “psychological testing, staged situations, and direct observation of their emulation software to form clear pictures of their loyalties.” But genetic and cultural evolution has long tried to make human minds fit well within strongly loyal teams, a task to which we seem well adapted. This suggests that moving our minds closer to a “borg” team ideal would cost us somewhere else, such as in our mental agility.

On the concrete coordination gains that Shulman sees from superorganism ems, most of these gains seem cheaply achievable via simple long-standard human coordination mechanisms: property rights, contracts, and trade. Individual farmers have long faced starvation if they could not extract enough food from their property, and farmers were often out-competed by others who used resources more efficiently.

With ems there is the added advantage that em copies can agree to the “terms” of their life deals before they are created. An em would agree that it starts life with certain resources, and that life will end when it can no longer pay to live. Yes there would be some selection for humans and ems who peacefully accept such deals, but probably much less than needed to get loyal devotion to and shared values with a superorganism.

Yes, with high value sharing ems might be less tempted to steal from other copies of themselves to survive. But this hardly implies that such ems no longer need property rights enforced. They’d need property rights to prevent theft by copies of other ems, including being enslaved by them. Once a property rights system exists, the additional cost of applying it within a set of em copies seems small relative to the likely costs of strong value sharing.

Shulman seems to argue both that superorganisms are a natural endpoint of evolution, and that ems are especially supportive of superorganisms. But at most he has shown that ems organizations may be at a somewhat larger scale, not that they would reach civilization-encompassing scales. In general, creatures who share values can indeed coordinate better, but perhaps not by much, and it can be costly to achieve and maintain shared values. I see no coordinate-by-values free lunch...


1. The natural endpoint

Bostrom says that a singleton is natural conclusion of long-term trend toward larger scales of political integration (p176). It seems helpful here to be more precise about what we mean by singleton. Something like a world government does seem to be a natural conclusion to long term trends. However this seems different to the kind of singleton I took Bostrom to previously be talking about. A world government would by default only make a certain class of decisions, for instance about global level policies. There has been a long term trend for the largest political units to become larger, however there have always been smaller units as well, making different classes of decisions, down to the individual. I'm not sure how to measure the mass of decisions made by different parties, but it seems like the individuals may be making more decisions more freely than ever, and the large political units have less ability than they once did to act against the will of the population. So the long term trend doesn't seem to point to an overpowering ruler of everything.

2. How value-aligned would emulated copies of the same person be?

Bostrom doesn't say exactly how 'emulations that were wholly altruistic toward their copy-siblings' would emerge. It seems to be some combination of natural 'altruism' toward oneself and selection for people who react to copies of themselves with extreme altruism (confirmed by a longer interesting discussion in Shulman's paper). How easily one might select for such people depends on how humans generally react to being copied. In particular, whether they treat a copy like part of themselves, or merely like a very similar acquaintance.

The answer to this doesn't seem obvious. Copies seem likely to agree strongly on questions of global values, such as whether the world should be more capitalistic, or whether it is admirable to work in technology. However I expect many—perhaps most—failures of coordination come from differences in selfish values—e.g. I want me to have money, and you want you to have money. And if you copy a person, it seems fairly likely to me the copies will both still want the money themselves, more or less.

From other examples of similar people—identical twins, family, people and their future selves—it seems people are unusually altruistic to similar people, but still very far from 'wholly altruistic'. Emulation siblings would be much more similar than identical twins, but who knows how far that would move their altruism?

Shulman points out that many people hold views about personal identity that would imply that copies share identity to some extent. The translation between philosophical views and actual motivations is not always complete however.

3. Contemporary family clans

Family-run firms are a place to get some information about the trade-off between reducing agency problems and having access to a wide range of potential employees. Given a brief perusal of the internet, it seems to be ambiguous whether they do better. One could try to separate out the factors that help them do better or worse.

4. How big a problem is disloyalty?

I wondered how big a problem insider disloyalty really was for companies and other organizations. Would it really be worth all this loyalty testing? I can't find much about it quickly, but 59% of respondents to a survey apparently said they had some kind of problems with insiders. The same report suggests that a bunch of costly initiatives such as intensive psychological testing are currently on the table to address the problem. Also apparently it's enough of a problem for someone to be trying to solve it with mind-reading, though that probably doesn't say much.

5. AI already contributing to the surveillance-secrecy arms race

Artificial intelligence will help with surveillance sooner and more broadly than in the observation of people's motives. e.g. here and here.

6. SMBC is also pondering these topics this week

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. What are the present and historical barriers to coordination, between people and organizations? How much have these been lowered so far? How much difference has it made to the scale of organizations, and to productivity? How much further should we expect these barriers to be lessened as a result of machine intelligence?
  2. Investigate the implications of machine intelligence for surveillance and secrecy in more depth.
  3. Are multipolar scenarios safer than singleton scenarios? Muehlhauser suggests directions.
  4. Explore ideas for safety in a singleton scenario via temporarily multipolar AI. e.g. uploading FAI researchers (See Salamon & Shulman, “Whole Brain Emulation, as a platform for creating safe AGI.”)
  5. Which kinds of multipolar scenarios would be more likely to resolve into a singleton, and how quickly?
  6. Can we get whole brain emulation without producing neuromorphic AGI slightly earlier or shortly afterward? See section 3.2 of Eckersley & Sandberg (2013).
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the 'value loading problem'. To prepare, read “The value-loading problem” through “Motivational scaffolding” from Chapter 12The discussion will go live at 6pm Pacific time next Monday 26 January. Sign up to be notified here.

Slides online from "The Future of AI: Opportunities and Challenges"

13 ciphergoth 16 January 2015 11:17AM

In the first weekend of this year, the Future of Life institute hosted a landmark conference in Puerto Rico: "The Future of AI: Opportunities and Challenges". The conference was unusual in that it was not made public until it was over, and the discussions were under Chatham House rules. The slides from the conference are now available. The list of attenders includes a great many famous names as well as lots of names familiar to those of us on Less Wrong: Elon Musk, Sam Harris, Margaret Boden, Thomas Dietterich, all three DeepMind founders, and many more.

This is shaping up to be another extraordinary year for AI risk concerns going mainstream!

View more: Next