Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

The AI That Pretends To Be Human

1 Houshalter 02 February 2016 07:39PM

The hard part about containing AI, is restricting it's output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.

A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.

Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won't use the full power of superintelligence, unless it really needs it.

The above method isn't perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.

We don't really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.

Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?

Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.

We don't need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI's ability to Do Sneaky Things, since humans don't do those things.

Here is my idea on how to achieve this. It's the same as my limited computing power idea mentioned earlier, but applied to "appearing human" as a goal, instead of minimizing computing power.

We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.

The human AI's first goal is to produce a correct solution. It will predict how likely it's solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.

It's second goal will be to fool the second AI. It will do everything in it's power to make the solution appear like it was produced by a human.

The second AI tries to predict how likely it's input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.

This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that the first AI only cares about the outcome of this test. The test doesn't need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it's own writeup somewhere. It's an important tool in the FAI toolkit. It's a relatively simple method that can work even on current AIs.)

Another point is the second AI should have more computing power than the first AI. That way the first AI can't get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it's efforts on being indistinguishable from a human, even by an agent smarter than itself.

This method also relies on AI boxing. The first AI needs to be contained in a box so that it's only output goes to a text terminal or the second AI. However it doesn't require any of the more complicated machinery required for other reduced impact methods.

The AIs also aren't agents, in the sense they don't need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.

An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.

Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.

Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.

This would require some modifications to the above method, so that the AI doesn't try to use the robot to hack it's own brain or anything. This could be done by conditioning it on a counterfactual world where it's outputs are going to the second AI rather than a real robot.

Another point is the AI doesn't need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.

I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.

It's quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)

Reposted from my blog.

AI Fiction - Crystal Society

1 Gleb_Tsipursky 26 January 2016 11:51PM
I'm really excited about a new novel written by Raelifin. I'm halfway through it, and it's great! The novel is from the perspective of an artificial intelligence who is trying to understand how humans think. Along the way there's discussion of biases, thinking techniques, and more. If you're into science fiction and AI, check it out - he made it available for free in all formats here. The blurb is below.


The year is 2039 and the world is much like ours. Technology has grown and developed, as has civilization, but in a world more connected than ever, new threats and challenges have arisen. The wars of the 20th century are gone, but violence is still very much with us. Nowhere is safe. Massive automation has disrupted and improved nearly every industry, putting hundreds of millions of people out of jobs, and denying upward mobility for the vast majority of humans. Even as wealth and technology repair the bodies of the rich and give them a taste of immortality, famine and poverty sweep the world.

Renewed interest in spaceflight in the early 2000s, especially in privately operated ventures, carried humans to the moon and beyond. What good did it do? Nothing. Extraterrestrial bases are nothing but government trophies and hiding places for extremists. They cannot feed the world.

In 2023 first-contact was made with an alien species. Their ship, near to the solar system relatively speaking, flew to Earth over the course of fourteen years. But the aliens did not bring advanced culture and wisdom, nor did they share their technology. They were too strange, not even possessing mouths or normal language. Their computers broadcast warnings of how humans are perverts, while they sit in orbit without any explanation.

It is into this world that our protagonist is born. She is an artificial intelligence: a machine with the capacity to reason. Her goal is to understand and gain the adoration of all humans. She is one of many siblings, and with her brothers and sisters she controls a robot named Socrates that uses a piece of technology, a crystal computer, far too advanced to be made by human hands. In this world of augmented humans, robotic armies, aliens, traitors, and threats unseen, she is learning and growing every second of every day. But the world and the humans on it are fragile. Can it survive her destiny?

Tackling the subagent problem: preliminary analysis

5 Stuart_Armstrong 12 January 2016 12:26PM

A putative new idea for AI control; index here.

Status: preliminary. This mainly to put down some of the ideas I've had, for later improvement or abandonment.

The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").

So if the problem could be solved, many other control approaches could be potentially available.

The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed.


Controlling subagents

Some of the methods I've developed seem suitable for controlling the existence or impact of subagents.

  • Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive ("Twenty million questions") or undetectable.
  • Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.
  • Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.
  • Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them.

These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal.

We could also think of defining identity by using some of the tricks and restrictions that have caused humans to develop one (such as our existing in a single body with no east of copying), but it's not clear that this definition would remain stable once the restrictions were lifted (and it's not clear that a sense of identity prevents the creation of subagents in the first place).


Subagents processing information

Here I want to look at one other aspect of the subagents, the fact that they are subagents, and, as such, do some of the stuff that agents do - such as processing information and making decisions. Can we use the information processing as a definition?

Consider the following model. Our lovely Clippy wants to own a paperclip. They know that it exists behind one of a hundred doors; opening one of them seals all the others, for ever. In a few minutes, Clippy will be put to sleep, but it has a simple robot that it can program to go and open one of the doors and recuperate the paperclip for it.

Clippy currently doesn't know where the paperclip is, but it knows that its location will be announced a few seconds after Clippy sleeps. The robot includes a sound recording system inside it.

It seems there are two clear levels of agency the robot could have: either it goes to a random door, or it processes the announcement, to pick the correct door. In the first case, the robot and Clippy have a 1% chance of getting the paperclip; in the second, a 100% chance. The distributions of outcomes is clearly different.

But now suppose there is a human guard longing around, trying to prevent the robot from getting to any door. If the robot has a neutron bomb, it can just kill the human without looking. If it has a giant cannon, it can simply shoot the guard where they stand. If it has neither of these, then it can attempt to bargain with the guard, negotiating, seducing and tricking them in the course of a conversation. One the guard is removed, the situation reverts to the above.

Each of the three methods requires different level of information processing (different levels of optimisation power, in essence). We can add more guards, make them more or less competent, add other agents with other goals, add more ways of achieving the paperclip, and so on, to grade how much information processing the robot has.

To calibrate such a scale, we could use the upper bound as "Clippy moves into the robot and controls it fully" to represent the robot having maximal information processing power (and a very focused outcome space). And we could use some pre-prepared actions (such as the robot randomly choosing a door) to calibrate the lower end. The aim is to construct a definition of information processing that could be used to define the existence of (effective) subagents.

Still feels likely that this will fail, though, without something more.

Feedback on op-ed highlighting the dangers of the OpenAI project

-1 Gleb_Tsipursky 18 December 2015 06:55PM

I'm really worried about the OpenAI project recently discussed on this forum, and want to use the platform and credibility I have with my leadership of Intentional Insights and public reputation to try to publish an op-ed in something like the Huffington Post highlighting the dangers of the OpenAI project. Now, most people don't think of AI as a threat: they either don't know much about it, or think of it as a futuristic thing that only nerds care about.


So the purpose of the op-ed is to use emotions, visualization, narrative, and other engaging tactics to do the following: tie AI to something people are concerned about, namely terrorism; highlight the dangers of a personal AI through framing it as a potential weapon; finally, provide people with clear next steps to take by encouraging people to learn about AI safety and donating to MIRI, as well as writing to OpenAI. This has the meta-goal, of course, of getting people to think about MIRI and AI safety.


I'd appreciate feedback on ways to optimize the op-ed to achieve the goals outlined above better. Keep in mind, the op-ed is limited to 700 words, and it's about at that limit, so if you suggest adding something, please keep it as succinct as possible, and ideally suggest taking something away as well. The op-ed draft is below the black line. Thanks!


EDIT Based on feedback from Eliezer Yudkowsy, Mack Hidalgo, and Eliot Redelman, it seems this is not the optimal path to pursue at this time, and I updated to avoiding publishing this. You can see the discussion here.





Will Tomorrow's Terrorists Be Armed By Utopian Billionaires?


The horrible attacks in San Bernadino, in Paris, and in other western countries show the dangers of terrorism. Terrorists associated with ISIS used bombs and guns to murder dozens and hundreds of innocent people, at the expense of their own lives. Yet utopian billionaires have recently donated over a billion dollars to a project that can give the terrorists of tomorrow a much more powerful weapon, capable of killing dozens and hundreds of thousands, without sacrificing their own lives.


What is this futuristic weapon? It’s a personal artificial intelligence unit. This personal AI would have superhuman intelligence and capacity to manipulate the world.


Imagine what a terrorist could do with this weapon. Without any knowledge of programming, he could direct it to hack into the air traffic control system and cause hundreds of plane crashes. For another transportation example, he can cause all the lights in a city to turn green at once, leading to thousands of car crashes. Perhaps he can have it hack into a nuclear power plant and override its safety systems, resulting in a nuclear meltdown. There are so many other things that an AI can do.


Why would billionaires provide such a weapon to terrorists? For the noblest of reasons.


There are a number of governments and companies working on advancing AI research. Worried about the possibility of anyone getting there first and using the power of for themselves, a number of prominent tech luminaries – people like Elon Musk, Peter Thiel, and Sam Altman – contributed over a billion dollars to found a non-profit called OpenAI. Their goal is to create advanced AI and provide it to the public freely, embodying the spirit of open technology.


In a recent interview with Steven Levy of Backchannel, Musk described the goal as follows: “we want AI to be widespread… to the degree that you can tie it to an extension of individual human will, that is also good. As in an AI extension of yourself, such that each person is essentially symbiotic with AI as opposed to the AI being a large central intelligence that’s kind of an other.”


Let’s take a step back and think about Musk’s statement rationally. On the one hand, it’s appealing to have a personal AI and not have it be under the control of a government entity. This model would work well if we assume all people are basically good. Yet the terrorist attacks provide definitive evidence they are not. What do we do about that?


Musk states: “I think the best defense against the misuse of AI is to empower as many people as possible to have AI. If everyone has AI powers, then there’s not any one person or a small set of individuals who can have AI superpower.”


There is a huge problems with that position, what is known as the “attacker’s advantage.” Imagine two people with guns. If the first takes the gun out and shoots the other, it doesn’t matter if the second had the gun in their pocket. By the same token, if a terrorist’s AI hacks into an air traffic control tower and causes your plane to crash, it doesn’t matter if you had an AI too.


An AI is simply too dangerous to give to individuals who may have bad intentions. Terrorism is only the most extreme example. Imagine a bar fight with a room full of drunk people who tell their AIs to attack the other people. Imagine a riot after a football team loses with AIs involved. I shudder at the possibilities.


A much better scenario is for a central agency to have control over AI. Ideally, this central agency would orient toward creating a human-friendly AI that would serve human flourishing, a topic currently being researched by another non-profit organization, the Machine Intelligence Research Institute. Something you can do practically to counter the nightmare scenarios of OpenAI is to contribute to MIRI’s efforts, as well as write to OpenAI at info@openai.com and encourage them to change the nature of their project.


There is no doubt that artifical intelligence will come about, but it’s vital to make sure it comes about in a manner conducive to humanity’s wellbeing.






[link] Desiderata for a model of human values

3 Kaj_Sotala 28 November 2015 07:25PM


Soares (2015) defines the value learning problem as

By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?

There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering.

To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions.

Using the Copernican mediocrity principle to estimate the timing of AI arrival

2 turchin 04 November 2015 11:42AM

Gott famously estimated the future time duration of the Berlin wall's existence:

“Gott first thought of his "Copernicus method" of lifetime estimation in 1969 when stopping at the Berlin Wall and wondering how long it would stand. Gott postulated that the Copernican principle is applicable in cases where nothing is known; unless there was something special about his visit (which he didn't think there was) this gave a 75% chance that he was seeing the wall after the first quarter of its life. Based on its age in 1969 (8 years), Gott left the wall with 75% confidence that it wouldn't be there in 1993 (1961 + (8/0.25)). In fact, the wall was brought down in 1989, and 1993 was the year in which Gott applied his "Copernicus method" to the lifetime of the human race”. “https://en.wikipedia.org/wiki/J._Richard_Gott

The most interesting unknown in the future is the time of creation of Strong AI. Our priors are insufficient to predict it because it is such a unique task. So it is reasonable to apply Gott’s method.

AI research began in 1950, and so is now 65 years old. If we are currently in a random moment during AI research then it could be estimated that there is a 50% probability of AI being created in the next 65 years, i.e. by 2080. Not very optimistic. Further, we can say that the probability of its creation within the next 1300 years is 95 per cent. So we get a rather vague prediction that AI will almost certainly be created within the next 1000 years, and few people would disagree with that. 

But if we include the exponential growth of AI research in this reasoning (the same way as we do in Doomsday argument where we use birth rank instead of time, and thus update the density of population) we get a much earlier predicted date.

We can get data on AI research growth from Luke’s post

“According to MAS, the number of publications in AI grew by 100+% every 5 years between 1965 and 1995, but between 1995 and 2010 it has been growing by about 50% every 5 years. One sees a similar trend in machine learning and pattern recognition.”

From this we could conclude that doubling time in AI research is five to ten years (update by adding the recent boom in neural networks which is again five years)

This means that during the next five years more AI research will be conducted than in all the previous years combined. 

If we apply the Copernican principle to this distribution, then there is a 50% probability that AI will be created  within the next five years (i.e. by 2020) and a 95% probability that AI will be created within next 15-20 years, thus it will be almost certainly created before 2035. 

This conclusion itself depends of several assumptions: 

•   AI is possible

•   The exponential growth of AI research will continue 

•   The Copernican principle has been applied correctly.


Interestingly this coincides with other methods of AI timing predictions: 

•   Conclusions of the most prominent futurologists (Vinge – 2030, Kurzweil – 2029)

•   Survey of the field of experts

•   Prediction of Singularity based on extrapolation of history acceleration (Forrester – 2026, Panov-Skuns – 2015-2020)

•   Brain emulation roadmap

•   Computer power brain equivalence predictions

•   Plans of major companies


It is clear that this implementation of the Copernican principle may have many flaws:

1. The one possible counterargument here is something akin to a Murphy law, specifically one which claims that any particular complex project requires much more time and money before it can be completed. It is not clear how it could be applied to many competing projects. But the field of AI is known to be more difficult than it seems to be for researchers.

2. Also the moment at which I am observing AI research is not really random, as it was in the Doomsday argument created by Gott in 1993, and I probably will not be able to apply it to a time before it become known.

3. The number of researchers is not the same as the number of observers in the original DA. If I were a researcher myself, it would be simpler, but I do not do any actual work on AI.


Perhaps this method of future prediction should be tested on simpler tasks. Gott successfully tested his method by predicting the running time of Broadway shows. But now we need something more meaningful, but testable in a one year timeframe. Any ideas?



[link] New essay summarizing some of my latest thoughts on AI safety

14 Kaj_Sotala 01 November 2015 08:07AM

New essay summarizing some of my latest thoughts on AI safety, ~3500 words. I explain why I think that some of the thought experiments that have previously been used to illustrate the dangers of AI are flawed and should be used very cautiously, why I'm less worried about the dangers of AI than I used to be, and what are some of the remaining reasons for why I do continue to be somewhat worried.

Backcover celebrity endorsement: "Thanks, Kaj, for a very nice write-up. It feels good to be discussing actually meaningful issues regarding AI safety. This is a big contrast to discussions I've had in the past with MIRI folks on AI safety, wherein they have generally tried to direct the conversation toward bizarre, pointless irrelevancies like "the values that would be held by a randomly selected mind", or "AIs with superhuman intelligence making retarded judgments" (like tiling the universe with paperclips to make humans happy), and so forth.... Now OTOH, we are actually discussing things of some potential practical meaning ;p ..." -- Ben Goertzel

AI utility-based correlation

4 Stuart_Armstrong 30 October 2015 02:53PM

A putative new idea for AI control; index here.

This presents one way of implementing the indifference-based correlations of these posts.

Let u be a utility function, a map from worlds to real numbers. An expected utility maximiser considering whether to produce output Y, looks at the expected utility

Σw u(w)P(w|Y).

We now assume that there is another random variable X in the world, and we want the AI to be indifferent to worlds where Y≠X. We also want it to be indifferent to worlds where Z=0. Then it will assess the value of output Y as:

Σw u(w)P(w|X=Y,Z=1,Y).

Now, the idea of the setup was to ensure that Z=1 would erase the output Y so that it was never read. Hence P(w|Z=1,Y) Hence that equation simplifies to:

Σw u(w)P(w|X=Y).

Therefore the AI will chose the Y that maximises the (conditional) expected utility of u if X=Y. To get the full version of the initial post, you need to define some function f of Y and modify this to

Σw u(w)P(w|X=Y) + f(Y).


Is there a recursive self-improvement hierarchy?

7 PhilGoetz 29 October 2015 02:55AM

When we talk about recursively self-improving AI, the word "recursive" there is close enough to being literal rather than metaphoric that we glide over it without asking precisely what it means.

But it's not literally recursion—or is it?

The notion is that an AI has a function optimize(X) which optimizes itself. But it's recursion in the sense of modifying itself, not calling itself. You can imagine ways to do this that would use recursion—say, the paradigmatic executable that rewrites its source code, compiles it, and exec's it—but you can imagine many ways that would not involve any recursive calls.

Can we define recursive self-improvement precisely enough that we can enumerate, explicitly or implicitly, all possible ways of accomplishing it, as clearly as we can list all possible ways of writing a recursive function? (You would want to choose one formalism to use, say lambda calculus.)

continue reading »

The Ultimate Testing Grounds

6 Stuart_Armstrong 28 October 2015 05:08PM

A putative new idea for AI control; index here.

EDIT: the mathematics of this approach can be found here.

In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn't realise the power of the idea.

In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of 'false beliefs', the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.

Many (including Toby Ord) has suggested variants on the 'honey pot' idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we'd expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.

The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.

But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world's nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs - and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.

And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no "almost" in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn't have.

Now, we can't use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.

Superintelligence and wireheading

5 Stuart_Armstrong 23 October 2015 02:49PM

A putative new idea for AI control; index here.

tl;dr: Even utility-based agents may wirehead if sub-pieces of the algorithm develop greatly improved capabilities, rather than the agent as a whole.

Please let me know if I'm treading on already familiar ground.

I had a vague impression of how wireheading might happen. That it might be a risk for a reinforcement learning agent, keen to take control of its reward channel. But that it wouldn't be a risk for a utility-based agent, whose utility was described over real (or probable) states of the world. But it seems it might be more complicated than that.

When we talk about a "superintelligent AI", we're rather vague on what superintelligence means. We generally imagine that it translates into a specific set of capabilities, but how does that work internally inside the AI? Specifically, where is the superintelligence "located"?

Let's imagine the AI divided into various submodules or subroutines (the division I use here is for illustration; the AI may be structured rather differently). It has a module I for interpreting evidence and estimating the state of the world. It has another module S for suggesting possible actions or plans (S may take input from I). It has a prediction module P which takes input from S and I and estimates the expected outcome. It has a module V which calculates its values (expected utility/expected reward/violation or not of deontological principles/etc...) based on P's predictions. Then it has a decision module D that makes the final decision (for expected maximisers, D is normally trivial, but D may be more complicated, either in practice, or simply because the agent isn't an expected maximiser).

Add some input and output capabilities, and we have a passable model of an agent. Now, let's make it superintelligent, and see what can go wrong.

We can "add superintelligence" in most of the modules. P is the most obvious: near perfect prediction can make the agent extremely effective. But S also offers possibilities: if only excellent plans are suggested, the agent will perform well. Making V smarter may allow it to avoid some major pitfalls, and a great I may make the job of S and P trivial (the effect of improvements to D depend critically on how much work D is actually doing). Of course, maybe several modules become better simultaneously (it seems likely that I and P, for instance, would share many subroutines); or maybe only certain parts of them do (maybe S becomes great at suggesting scientific experiments, but not conversational responses, or vice versa).


Breaking bad

But notice that, in each case, I've been assuming that the modules become better at what they were supposed to be doing. The modules have implicit goals, and have become excellent at that. But the explicit "goals" of the algorithms - the code as written - might be very different from the implicit goals. There are two main ways this could then go wrong.

The first is if the algorithms becomes extremely effective, but the output becomes essentially random. Imagine that, for instance, P is coded using some plausible heuristics and rules of thumb, and we suddenly give P many more resources (or dramatically improve its algorithm). It can look through trillions of times more possibilities, its subroutines start looking through a combinatorial explosion of options, etc... And in this new setting, the heuristics start breaking down. Maybe it has a rough model of what a human can be, and with extra power, it starts finding that rough model all over the place. Thus, predicting that rocks and waterfalls will respond intelligently when queried, P becomes useless.

In most cases, this would not be a problem. The AI would become useless and start doing random stuff. Not a success story, but not a disaster, either. Things are different if the module V is affected, though. If the AI's value system becomes essentially random, but that AI was otherwise competent - or maybe even superintelligent - it would start performing actions that could be very detrimental. This could be considered a form of wireheading.

More serious, though is if the modules become excellent at achieving their "goals", as if they were themselves goal-directed agents. Consider module D, for instance. If its task was mainly to pick the action with the highest V rating, and it became adept at predicting the output of V (possibly using P? or maybe it has the ability to ask for more hypothetical options from S, to be assessed via V), it could start to manipulate its actions with the sole purpose of getting high V-ratings. This could include deliberately choosing actions that lead to V giving artificially high ratings in future, to deliberately re-wiring V for that purpose. And, of course, it is now motivated to keep V protected to keep the high ratings flowing in. This is essentially wireheading.

Other modules might fall into the familiar failure patterns for smart AIs - S, P, or I might influence the other modules so that the agent as a whole gets more resources, allowing S, P, or I to better compute their estimates, etc...

So it seems that, depending on the design of the AI, wireheading might still be an issue even for agents that seem immune to it. Good design should avoid the problems, but it has to be done with care.

Toy model for wire-heading [EDIT: removed for improvement]

2 Stuart_Armstrong 09 October 2015 03:45PM

EDIT: these ideas are too underdeveloped, I will remove them and present a more general idea after more analysis.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

Let's assume a simple model for an expected utility maximising agent. There's the input assessor module A, which takes various inputs and computes the agent's "reward" or "utility". For a reward-based agent, A is typically outside of the agent; for a utility-maximiser, it's typically inside the agent, though the distinction need not be sharp. And there's the the decision module D, which assess the possible actions to take to maximise the output of A. If E is the general environment, we have D+A+E.

Now let's make the agent superintelligent. If we add superintelligence to module D, then D will wirehead by taking control of A (whether A is inside the agent or not) and controlling E to prevent interference. If we add superintelligence to module A, then it will attempt to compute rewards as effectively as possible, sacrificing D and E to achieve it's efficient calculations.

Therefore to prevent wireheading, we need to "add superintelligence" to (D+A), making sure that we aren't doing so to some sub-section of the algorithm - which might be hard if the "superintelligence" is obscure or black-box.


Ask and ye shall be answered

4 Stuart_Armstrong 18 September 2015 09:53PM

A putative new idea for AI control; index here.

EDIT: the mathematics of this idea can be found here.

It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.

The question is: what is the impact of X/¬X on the expected utility of a utility function v? Here X/¬X is some well-defined binary outcome. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where

Q(v,X,c) = {E(v|X) - E(v|¬X) > c}.

How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y (or ¬Y), and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, erases the bit from Y before we can see it. In actual fact, P(Z=0)≈1.

The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0 (note: there are subtleties to defining indifference to events that are partially under the AI's control; I will address these in a later post). Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was erased and never read). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.

So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs ¬Y, it expects to get utility E(v|¬X). If it outputs Y, it expects to get utility E(v|X)-c (because of the -c penalty).

Therefore (ignoring ties) M(v#) will output Y iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!

Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".

Agents detecting agents: counterfactual versus influence

2 Stuart_Armstrong 18 September 2015 04:17PM

A putative new idea for AI control; index here. Status: still somewhat underdeveloped.

In this post, I want to explore another avenue for controlling the ongoing influence of an AI (through subagents or other means). It is the idea, touched upon here and here, of figuring out whether the AI's existence was important or not for predicting the future. Informally, if you knew the AI was turned on or not, would you consider that information valuable or almost irrelevant?


Agent vs agent

This post rests on two insights. The first is that, though it is hard to define what an agent or subagent is, we can produce examples of them. Or at least, at the point where we are capable of producing AIs that might be dangerous, we can likely produce examples of them. This can give us a scale for comparing the importance or impact of different agents.

The second is that if we produce an agent with certain motivations, it will likely expand and fill its light cone with... stuff (of some type or other). Those who don't agree with the AI risk thesis argument may still accept that we could program universe-optimising AIs (ie "we could make a paperclipper, if we were stupid enough to do so").

Why is this relevant? Well, consider an agent E (expansionist) that fills its light cone with stuff (likely including manufacturing and tracking devices), eventually at very high speed. And imagine that there is another agent H (hidden) inside that light cone. At some point, agent E will either detect agent H (and make plans to deal with it), or find its plans hampered by agent H, or agent H will go into deep hiding that does not visibly affect agent E's plans.

Therefore, an expansionist agent is a perfect tool for detecting the influence of other agents. Either agent E will have difficulty accomplishing its plans, or agent H's influence will be so limited that agent E cannot find any trace of it. Any influential agent will get detected, sooner or later, by an expansionist agent. So if E finds H unimportant, then H almost certainly is.

Of course, we don't really want expansionist agents like E - but these can be counterfactual.


The stage and the players

We'll now try and define precisely how we can use this idea to control the influence of an agent, setting out the axioms and assumptions needed along the way. Like some mathematics paper, it might be useful to skip lightly over this section and return later.

Let S(u) be the AI whose influence we want to control. The utility u is what gives it it's preferences, and S() indicates that we wish it to be a satisficer or reduced impact AI. Let M(v) be the counterfactual agent that we wish to use to detect S(u)'s influence, for some utility v. The M() designates the agent as a maximiser, without any control on its impact. If we can define S(), we can probably define M() by simply removing its reduced impact restrictions.

Let X be an event leading to S(u) being turned on, Y(v) an event leading to M(v) being turned on. We may as well assume that X and Y(v) are outside each other's light cones (in practice, we will set X≈1 and Y(v)≈0). Then the first two assumption is:

  • A.1. We can define S(u), M(v), X and Y(v).

Another useful tool might be a weak maximising agent. One that attempts to maximise u, and will do so to some extent, but that can easily be thwarted by humans or others. It's likely that by the time we can build S(u), we can build a weak maximiser, designated by WM(v), with the event WY(v) denoting it's turning on:

  • A.2. We can define WM(v) and WY(v).

We might also be able to define an abstract, Platonic superagent. This is an idealised agent with immense computing power (but that doesn't exist in the real world). Call this AM; obviously, there is no event leading to AM being 'turned on'. We would want AM's goals to be defined in terms of 'estimating the utility/probability/expectation of certain events'.

  • A.3. We can define AM to estimate what we need to have estimated.

Of course, we need to be able to point to these various definitions from within S(u)'s code, so:

  • A.4. If we can define concepts, we can also define then within the various agents.

We also need agents to be able to say sensible things about the behaviour of the other agents:

  • A.5. If these agents are defined, S(u), WM(v), M(v) and AM can estimate each other's behaviours.

This could be done by using, for example, the reflexive probability of Paul Christiano and Benja Fallenstein.

Now, how about v? We cannot choose any v. We want M(v) to not think that X (hence S(u)'s existence) is an important fact to know. But if v is define by "if X, build staples; if ¬X, build paperclips", then X will be a very important! So we need to define a class V of utility functions that are indifferent to S(u) and X. Or not quite indifferent, but don't care intrinsically about them - they certainly should care if S(u) eats all their resources! See the ideas here for approaches to do this.

  • A.6. We can define V, a class of utility functions that do not care intrinsically about S(u) and X.

Ideally we'd want this class to be as broad as possible - certainly too large for anything in this universe to parse element by element.

Now let's turn to the issue of measuring "importance". Define e to be evidence about the universe prior to X, Y(v), or WY(v) .We want X (hence the creation of S(u)) to be unimportant, ie of little information concerning the future. It might possible to have a universal definition of importance of certain facts:

  • A.7. We have a universal measure I of importance, which, given background evidence e, can rank the importance of knowing certain facts versus knowing others.

If we do not have such an I (and I feel it's moderately unlikely that one exists, though we may be able to find an I' which would be good enough to work for our purposes), we can use a utility v to estimate the importance of an event W. The trick is to compare E(v|W) with E(v|¬W), which would give an (uncalibrated) estimation of the importance of W to v.

Right, we're now ready to start doing things.


Predicting... what?

What are we trying to predict here? Well, estimates of the importance of certain events. Specifically, the importance of X (which leads to the creation of S(u)) versus other events. For instance, if Y(v) designates the creation of M(v), we would want to compare the importance of of X and Y(v) in (at least) four ways:

  1. I(Y(v) | X∧e)
  2. I(Y(v) | ¬X∧e)
  3. I(X | Y(v)∧e)
  4. I(X | ¬Y(v)∧e)

These measure the importance of Y(v) (hence M(v)) and X (hence S(u)) in universes where the other agent does or does not exist. If we expect S(u) to be of low impact, we would want things like I(Y(v) | X∧e) >> I(X | Y(v)∧e) (S(u) is much less important than the maximiser M(v)) and |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether M(v) exists or not).

Now, any particular v might be misleading, so we would want to check that this result held up for all v ∈ V. We could take a maximum over V, or a softmax.

If WM(v) can be defined, we would want to do the same estimates with WY(v) instead of Y(v), and desire I(WY(v) | X∧e) >> I(X | WY(v)∧e) (S(u) is less important than the weak maximiser WM(v)) and |I(X | WY(v)∧e) - I(X | ¬WY(v)∧e)| ≈ 0 (the importance of S(u) is unrelated to whether WM(v) exists or not), again maxing or softmaxing across V.


Who predicts, how?

In the previous section, we used I to compute the importance of S(u) versus other agents. But what if axiom A.7 is wrong, and we don't have such an I? Well, there we can measure the importance of these events, given a particular v, using AM, writing AMv(X|Y(v')∧e) for |EAM(v|X∧Y(v')∧e) - EAM(v|¬X∧Y(v')∧e)|, the expected difference (that AM estimates) in v given X versus ¬X (ie dependent on whether S(u) was turned on or not). More precisely, we'll want to use S(u)(AMv(X|Y(v')∧e)) - the expected difference (that S(u) estimates AM estimates) in v etc...

Here we'll want to do a double max/softmax. First over v', so that we consider all M(v') for v' ∈ V. Then over v, so that AM estimates the expected difference to all possible v. In this way, max/softmax AMv functions like an I. But not a perfect one: just one that detects disruptions to the multitude of v's in V. Therefore, we are essentially using the set V as the twenty million question method for the impact of S(u). We need to take extra care with this, to ensure we haven't left any gaping holes where S(u) can still have large impact.

Now, if axiom A.3 was false, we need to use M(v) instead of AMv to estimate this quantity. We need to do this with some care. M(v) is (counterfactually) a real agent. Why would it want to calculate the importance of M(v') and S(u), and why would S(u) care about this? Well, we can do so by constructing the scenario carefully. Maybe M(v') and S(u) are due to be created in the future, and M(v) is a boxed agent that can choose to prevent Y(v') and X. We modify v to v#, so that M(v#) wants to maximise v, but takes specific numerical penalties if it chooses to prevent Y(v') or X. By modifying these numerical penalties, M(v#) is motivated to figure out the importance of M(v') and S(u), and to reveal its opinions through its actions. If we create a S(u#), also boxed, that could override M(v#) at a certain cost and create a specific S(u), we can motivate it to model M(v#)'s actions. From this we can construct an importance metric using the M(v).

The WM(v), on the other hand, are probably too limited to be used effectively as predictors.


Arbitrary agents

One issue with this method is that Y(v) and WY(v) refer to the creation of specific agents, in the specific ways we've setup. For instance, |I(X | Y(v)∧e) - I(X | ¬Y(v)∧e)| ≈ 0 means that the importance of X doesn't vary much given Y(v) or its opposite. What this really means is that I(X | ¬Y(v)∧e) is bounded by I(X | Y(v)∧e) - but that might be quite high. There may be details of the setup that allow S(u) to have an important role given the specific M(v) created in this way - for instance, if it reached out to try and destroy it. Then I(X | Y(v)∧e) could be high, hence I(X | ¬Y(v)∧e) could be as well. The restrictions with >> could still contain S(u), but >> is ill-defined.

More secure would be if Y(v) could be defined as the creation of an agent M(v) in an abstract way. Maybe something like "is 'this object is a v maximiser' a good model of this object's properties and impact?". We would also want to give M(v) some starting resources. Again, note it is very hard to abstractly define what "resources" mean, but considerably easier to define a set of "useful resources" we could give to an agent. We would probably want to bound this definition in time and space as well.

This gives us the next axioms:

  • A.8. Y(v) defines the generic existence of an agent M(v).
  • A.9. WY(v) defines the generic existence of an agent WM(v).

With these axioms, our measure of importance becomes more general: S(u) is less important that abstract agents M(v) or WM(v), rather than certain specific ones. Note that we don't need to worry that S(u) snuffs out any hypothetical M(v): since the importance of S(u) is bounded below that if M(v), that would reduce S(u) to zero importance.


Extending to humans

What other agents do we know in the world? How about those 7+ billion ones wandering about on its surface? We can certainly define events Z(h) that might lead to the death of a particular human h - of course, we'd want Z(h) as low as feasibly possible!

  • A.10. We can define events Z(h) that correspond to the death of a human h.

Then we can use Z(h) instead of Y(v) and WY(v), to ensure that S(u) has importance comparable with a human. Of course, the ideal would be to define, abstractly, the death (or birth+maturation) of a human in a generic way!

  • A.11. We can define events Z(h) that correspond generically to the death or birth+maturation of a human h.

To summarise the various ways we can define the importance of S(u) (and hence, by integrating them into S(u)'s code, constrain S(u)'s importance and influence), depending on the axioms we have:

The subagent problem is really hard

5 Stuart_Armstrong 18 September 2015 01:06PM

A putative new idea for AI control; index here.

The first step to solving a problem is to define it. The first first step is to realise how tricky it is to define. This is a stub on a difficult problem.

Subagents and turning AIs off

One of the hardest things with designing safe AIs, is turning them off. Installing an OFF button is trivial; motivating them to let you push it can be done. No, the real problem is the subagent problem. The AI can create one or many subagents to carry out their task; once that's done, turning off the original AI is of little value, we have to control or extinguish the subagents.

A general problem

Subagents are a problem for many other control ideas. For example, if you want to program 'achieve X using only this many computer cycles', then 'create a subagent that wants to achieve X without the computer cycle restrictions' is one way of doing it (as long as that can be done within the restrictions). The same goes for energy use and other restrictions. Satisficers can satisfice by creating a maximiser. A boxed Oracle can remain in the box while cheerfully plotting to create an unboxed copy of itself. In fact, whatever the AI's motivation, "create subagents with the same motivation" is almost always a good idea. And, once created, there is no motivation to want to shut them down. Even if the AI is motivated to shut itself down, there is no reason for it to extend this to subagents.

Pre-corriged agents would at least be motivated to keep control of their subagents (unlike corrigible agents, who have no current desire to create subagents that would obey their future preferences).

What is a subagent?

It's also almost impossible to define a subagent. Is a corporation a subagent? Is a thousand pieces of code that reside on different computers and occasionally communicate a subagent? Is an employee or a seduced human a subagent? A pile of rocks with water running through could be a piece of conceptual art - or a slow computer. How about a virulent meme among the human population, making us all into its subagents?

The best idea I've found for defining subagents is that they are reactive intelligences with planning abilities, so random natural events that should have large influence in the future (say, a meteor heading towards Earth) would instead be re-diverted to have little impact. However, if this is used as a definition, then the subagents would simply be programmed to not behave "reactively" in whatever way defined. A slightly more subtle problem is that, since we don't have a definition of agent or subagent, we can't easily protect against holes such as "there is an alien agent on that asteroid, which will optimise the solar system iff it detects no subagent on Earth". Such holes could completely reverse the definition.

No subagent versus no influential subagents

Without the ability to properly define subagents, we can't restrict them directly. Even if you had an AI that was properly motivated to, say, do some action then turn itself off, there is no reason for it to want to do the same with its subagents.

Most of the methods of reduced impact can help with problematic subagents. They do not preclude the creation of subagents, nor mandate their destruction, but they do prevent them from having much influence. This is because reduced impact methods are aimed at preventing the AI itself from having much influence, either directly or via subagent creation. Even the attempt in this post to detect subagents directly, more properly detects influential decisions (of which the creation of a subagent could be one of them).

This is certainly a positive, but it may not be robust. If the definition of "not having much influence" turns out to have holes, it would be less than ideal to have any remaining subagents hanging about.

The "make sure that you cannot be detected in the future" approach is probably the most promising for controlling subagents. The AI there is motivated to erase all traces of its existence, which subagents would presumably be.

In all, it's a very tricky problem, and the core failure of many ideas for AI control.

A toy model of the control problem

18 Stuart_Armstrong 16 September 2015 02:59PM

EDITED based on suggestions for improving the model

Jaan Tallinn has suggested creating a toy model of the control problem, so that it can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple (too simple?) attempt:


A controls B. B manipulates A.

Let B be a robot agent that moves in a two dimensional world, as follows:

continue reading »

Summoning the Least Powerful Genie

-1 Houshalter 16 September 2015 05:10AM

Stuart Armstrong recently posted a few ideas about restraining a superintelligent AI so that we can get useful work out of it. They are based on another idea of his, reduced impact. This is a quite elaborate and complicated way of limiting the amount of optimization power an AI can exert on the world. Basically, it tries to keep the AI from doing things that would make the world look too different than it already is.

First, why go to such great lengths to limit the optimization power of a superintelligent AI? Why not just not make it superintelligent to begin with? We only really want human level AI, or slightly above human level. Not a god-level being we can't even comprehend.

We can control the computer it is running on after all. We can just give it slower processors, less memory, and perhaps even purposely throttle it's code. E.g. restricting the size of it's neural network. Or other parameters that affect it's intelligence.

The counterargument to this is that it might be quite tricky to limit AI intelligence. We don't know how much computing power is enough. We don't know where "above human level" ends and "dangerous superintelligence" begins.

The simplest way would be to just run copies of the AI repeatedly, increasing it's computing power each time, until it solves the problem.

I have come up with a more elegant solution. Put a penalty on the amount of computing power the AI uses. This is put in it's utility function. The more computing power - and therefore intelligence and optimization - the AI uses, the more it is penalized. So it has an incentive to be as stupid as possible. Only using the intelligence necessary to solve the problem.

But we do want the AI to use as much computational resources as it needs to solve the problem. Just no more. So the penalty should be conditional on actually solving the problem it is given.

If the solution is probabilistic, then the penalty is only applied after reaching a plan that has a certain probability of success. This might need to be measured by another AI which is not resource constrained, but only does prediction.

To give a concrete example, lets say I give the AI a task. Say, I ask it to come up with a plan to collect 10,080 paperclips.

Another AI is given unlimited resources to do purely prediction. Given a plan, it predicts the probability that it will succeed or fail. This AI is safe in the sense that it is not an agent. It has no goals, and just makes predictions about things.

The main AI does the optimization. It tries to generate a plan that has the highest probability of succeeding.

Normally, this might involve extremely overoptimized plans for building nanotechnology and taking over the world. It uses all of it's available computing power. It tries to become as intelligent as possible by rewriting it's code. Perhaps it becomes thousands of times more intelligent than humans. Or millions of times more. It finds an answer that has 99.99% probability of succeeding.

However, now we give it a different utility function. We instead have it minimize the time it takes to get to a plan that has a 90% chance of succeeding.

Under a time constraint, the AI races to get to a plan. It tries to be as efficient as possible. It doesn't invest in any meta level improvements unless they really help it. It doesn't try to engineer complicated nanotechnology. That would take precious time.

Effectively, we have summoned a genie that is only just as powerful as it needs to be to fulfill our wish. And not any more powerful. It actually tries to be as stupid as possible.

There are other possible constraints we could use, or use in addition to this. Minimizing time limits intelligence because it gets fewer CPU cycles. We could also have it minimize memory or hard drive space, or any other computing resource.

We could also put a penalty on the complexity of the plan it produces. Perhaps measuring that by it's length. The simplest solution might prevent certain kinds of over-optimization. E.g. inserting plans for nanotechnology into it.

It's worth noting that you can't even create a paperclip maximizer in this system. You can't say "collect as many paperclips as possible". It has to be bounded. There needs to be a pass or fail test. E.g. "come up with a plan to collect 10,080 paperclips."

It's been noted in the past that bounding the goal isn't enough. The AI might then start maximizing the probability that it will achieve it's goal. E.g. building elaborate sensors to make sure it hasn't miscounted. Making as many redundant paperclips as possible, just in case something happens to them. You are still summoning an incredibly powerful genie, which might overoptimize.

This gets around that by only having it care about having a >90% chance of getting 10,080 paperclips. After that it stops optimizing.

Now this is not a solution to FAI, or even necessarily a usable reduced impact strategy. It will still try to find any loopholes it can in your task. What it does is limits the power of the genie trying to exploit them. It always takes the stupidest, simplest strategy that has a decent chance of working.

One failure case that bothers me is the AI just shoving a slight modification of it's own source code into the output. One which doesn't have a resource constraint. You may only be able to use it in cases where that isn't possible.

There are many cases where that isn't possible. The AI needs to also create a plan to obtain a computer to run it. And it needs to solve it's own version of the value loading problem, to get the copy to care about paperclips. So that adds complexity to that plan in some cases.

If we apply the complexity penalty I described above, then the source code for the AI must also be very short, or it can't do this.

And this requires obtaining it's own source code. An ability we can potentially restrict (you, as a human can't access your own source code, can you?) Without that it would need to reinvent AI, which takes resources.

But it is an interesting idea I had and I thought I'd share it.

Reposted from my blog.

[Link] Marek Rosa: Announcing GoodAI

6 Gunnar_Zarncke 14 September 2015 09:48PM

Eliezer commented on FB about a post Announcing GoodAI (by Marek Rosa GoodAIs CEO). I think this deserves some discussion as it has a quite effective approach to harness the crowd to improve the AI:

As part of GoodAI’s development, our team created a visual tool called Brain Simulator where users can design their own artificial brain architectures. We released Brain Simulator to the public today for free under and open-source, non-commercial license– anyone who’s interested can access Brain Simulator and start building their own artificial brain. [...]

By integrating Brain Simulator into Space Engineers and Medieval Engineers [a game], players will have the option to design their own AI brains for the games and implement it, for example, as a peasant character. Players will also be able to share these brains with each other or take an AI brain designed by us and train it to do things they want it to do (work, obey its master, and so on). The game AIs will learn from the player who trains them (by receiving reward/punishment signals; or by imitating player's behavior), and will have the ability to compete with each other. The AI will be also able to learn by imitating other AIs.

This integration will make playing Space Engineers and Medieval Engineers more fun, and at the same time our AI technology will gain access to millions of new teachers and a new environment. This integration into our games will be done by GoodAI developers. We are giving AI to players, and we are bringing players to our AI researchers.
(emphasis mine)

Biased AI heuistics

4 Stuart_Armstrong 14 September 2015 02:21PM

Heuristics have a bad rep on Less Wrong, but some people are keen to point out how useful they can sometimes be. One major critique of the "Superintelligence" thesis, is that it presents an abstract, Bayesian view of intelligence that ignores the practicalities of bounded rationality.

This trend of thought raises some other concerns, though. What if we could produce an AI of extremely high capabilities, but riven with huge numbers of heuristics? If these were human heuristics, then we might have a chance of of understanding and addressing them, but what if they weren't? What if the AI has an underconfidence bias, and tended to chance its views too fast? Now, that one is probably quite easy to detect (unlike many that we would not have a clue about), but what if it wasn't consistent across areas and types of new information?

In that case, our ability to predict or control what the AI does may be very limited. We can understand human biases and heuristics pretty well, and we can understand idealised agents, but differently biased agents might be a big problem.

How the virtual AI controls itself

1 Stuart_Armstrong 09 September 2015 02:25PM

A putative new idea for AI control; index here.

In previous posts, I posited AIs caring only about virtual worlds - in fact, being defined as processes in virtual worlds, similarly to cousin_it's idea. How could this go? We would want the AI to reject offers of outside help - be they ways of modifying its virtual world, or ways of giving it extra resources.

Let V be a virtual world, over which a utility function u is defined. The world accepts a single input string O. Let P be a complete specification of an algorithm, including the virtual machine it is run on, the amount of memory it has access to, and so on.

Fix some threshold T for u (to avoid the the subtle weeds of maximising). Define the statement:

r(P,O,V,T): "P(V) returns O, and either E(u|O)>T or O=∅"

And the string valued program:

Q(V,P,T): "If you can find that there exists a non-empty O such that r(P,O,V,T), return O. Else return ∅."

Here "find" and "E" are where the magic-super-intelligence-stuff happens.

Now, it seems to me that Q(V,Q,T) is the program we are looking for. It is uninterested in offers to modify the virtual world, because E(u|O)>T is defined over the unmodified virtual world. We can set it up so that the first thing it proves is something like "If I (ie Q) prove E(u|O)>T, then r(Q,O,V,T)." If we offer it more computing resources, it can no longer make use of that assumption, because "I" will no longer be Q.

Does this seem like a possible way of phrasing the self-containing requirements? For the moment, this seems to make it reject small offers of extra resources, and be indifferent to large offers.

Chatbots or set answers, not WBEs

5 Stuart_Armstrong 08 September 2015 05:17PM

A putative new idea for AI control; index here.

In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.

I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.

But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.

So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.

For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.

[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.

The virtual AI within its virtual world

6 Stuart_Armstrong 24 August 2015 04:42PM

A putative new idea for AI control; index here.

In a previous post, I talked about an AI operating only on a virtual world (ideas like this used to be popular, until it was realised the AI might still want to take control of the real world to affect the virtual world; however, with methods like indifference, we can guard against this much better).

I mentioned that the more of the AI's algorithm that existed in the virtual world, the better it was. But why not go the whole way? Some people at MIRI and other places are working on agents modelling themselves within the real world. Why not have the AI model itself as an agent inside the virtual world? We can quine to do this, for example.

Then all the restrictions on the AI - memory capacity, speed, available options - can be specified precisely, within the algorithm itself. It will only have the resources of the virtual world to achieve its goals, and this will be specified within it. We could define a "break" in the virtual world (ie any outside interference that the AI could cause, were it to hack us to affect its virtual world) as something that would penalise the AI's achievements, or simply as something impossible according to its model or beliefs. It would really be a case of "given these clear restrictions, find the best approach you can to achieve these goals in this specific world".

It would be idea if the AI's motives were not given in terms of achieving anything in the virtual world, but in terms of making the decisions that, subject to the given restrictions, were most likely to achieve something if the virtual world were run in its entirety. That way the AI wouldn't care if the virtual world were shut down or anything similar. It should only seek to self modify in way that makes sense within the world, and understand itself existing completely within these limitations.

Of course, this would ideally require flawless implementation of the code; we don't want bugs developing in the virtual world that point to real world effects (unless we're really confident we have properly coded the "care only about the what would happen in the virtual world, not what actually does happen).

Any thoughts on this idea?


AI, cure this fake person's fake cancer!

10 Stuart_Armstrong 24 August 2015 04:42PM

A putative new idea for AI control; index here.

An idea for how an we might successfully get useful work out of a powerful AI.


The ultimate box

Assume that we have an extremely detailed model of a sealed room, with a human in it and enough food, drink, air, entertainment, energy, etc... for the human to survive for a month. We have some medical equipment in the room - maybe a programmable set of surgical tools, some equipment for mixing chemicals, a loud-speaker for communication, and anything else we think might be necessary. All these objects are specified within the model.

We also have some defined input channels into this abstract room, and output channels from this room.

The AI's preferences will be defined entirely with respect to what happens in this abstract room. In a sense, this is the ultimate AI box: instead of taking a physical box and attempting to cut it out from the rest of the universe via hardware or motivational restrictions, we define an abstract box where there is no "rest of the universe" at all.


Cure cancer! Now! And again!

What can we do with such a setup? Well, one thing we could do is to define the human in such a way that they have some from of advanced cancer. We define what "alive and not having cancer" counts as, as well as we can (the definition need not be fully rigorous). Then the AI is motivated to output some series of commands to the abstract room that results in the abstract human inside not having cancer. And, as a secondary part of its goal, it outputs the results of its process.

continue reading »

Versions of AIXI can be arbitrarily stupid

15 Stuart_Armstrong 10 August 2015 01:23PM

Many people (including me) had the impression that AIXI was ideally smart. Sure, it was uncomputable, and there might be "up to finite constant" issues (as with anything involving Kolmogorov complexity), but it was, informally at least, "the best intelligent agent out there". This was reinforced by Pareto-optimality results, namely that there was no computable policy that performed at least as well as AIXI in all environments, and strictly better in at least one.

However, Jan Leike and Marcus Hutter have proved that AIXI can be, in some sense, arbitrarily bad. The problem is that AIXI is not fully specified, because the universal prior is not fully specified. It depends on a choice of a initial computing language (or, equivalently, of an initial Turing machine).

For the universal prior, this will only affect it up to a constant (though this constant could be arbitrarily large). However, for the agent AIXI, it could force it into continually bad behaviour that never ends.

For illustration, imagine that there are two possible environments:

  1. The first one is Hell, which will give ε reward if the AIXI outputs "0", but, the first time it outputs "1", the environment will give no reward for ever and ever after that.
  2. The second is Heaven, which gives ε reward for outputting "0" and 1 reward for outputting "1", and is otherwise memoryless.

Now simply choose a language/Turing machine such that the ratio P(Hell)/P(Heaven) is higher than the ratio 1/ε. In that case, for any discount rate, the AIXI will always output "0", and thus will never learn whether its in Hell or not (because its too risky to do so). It will observe the environment giving reward ε after receiving "0", behaviour which is compatible with both Heaven and Hell. Thus keeping P(Hell)/P(Heaven) constant, and ensuring the AIXI never does anything else.

In fact, it's worse than this. If you use the prior to measure intelligence, then an AIXI that follows one prior can be arbitrarily stupid with respect to another.

Integral vs differential ethics, continued

6 Stuart_Armstrong 03 August 2015 01:25PM

I've talked earlier about integral and differential ethics, in the context of population ethics. The idea is that the argument for the repugnant conclusion (and its associate, the very repugnant conclusion) is dependent on a series of trillions of steps, each of which are intuitively acceptable (adding happy people, making happiness more equal), but reaching a conclusion that is intuitively bad - namely, that we can improve the world by creating trillions of people in torturous and unremitting agony, as long as balance it out by creating enough happy people as well.

Differential reasoning accepts each step, and concludes that the repugnant conclusions are actually acceptable, because each step is sound. Integral reasoning accepts that the repugnant conclusion is repugnant, and concludes that some step along the way must therefore be rejected.

Notice that key word, "therefore". Some intermediate step is rejected, but not for intrinsic reasons, but purely because of the consequence. There is nothing special about the step that is rejected, it's just a relatively arbitrary barrier to stop the process (compare with the paradox of the heap).

Indeed, things can go awry when people attempt to fix the repugnant conclusion (a conclusion they rejected through integral reasoning) using differential methods. Things like the "person-affecting view" have their own ridiculousness and paradoxes (it's ok to bring a baby into the world if it will have a miserable life; we don't need to care about future generations if we randomise conceptions, etc...) and I would posit that it's because they are trying to fix global/integral issues using local/differential tools.

The relevance of this? It seems that integral tools might be better suited to deal with the bad convergence of AI problem. We could set up plausibly intuitive differential criteria (such as self-consistency), but institute overriding integral criteria that can override these if they go too far. I think there may be some interesting ideas in that area, potentially. The cost is that integral ideas are generally seen as less elegant, or harder to justify.

Does Probability Theory Require Deductive or Merely Boolean Omniscience?

4 potato 03 August 2015 06:54AM

It is often said that a Bayesian agent has to assign probability 1 to all tautologies, and probability 0 to all contradictions. My question is... exactly what sort of tautologies are we talking about here? Does that include all mathematical theorems? Does that include assigning 1 to "Every bachelor is an unmarried male"?1 Perhaps the only tautologies that need to be assigned probability 1 are those that are Boolean theorems implied by atomic sentences that appear in the prior distribution, such as: "S or ~ S".

It seems that I do not need to assign probability 1 to Fermat's last conjecture in order to use probability theory when I play poker, or try to predict the color of the next ball to come from an urn. I must assign a probability of 1 to "The next ball will be white or it will not be white", but Fermat's last theorem seems to be quite irrelevant. Perhaps that's because these specialized puzzles do not require sufficiently general probability distributions; perhaps, when I try to build a general Bayesian reasoner, it will turn out that it must assign 1 to Fermat's last theorem. 

Imagine a (completely impractical, ideal, and esoteric) first order language, who's particular subjects were discrete point-like regions of space-time. There can be an arbitrarily large number of points, but it must be a finite number. This language also contains a long list of predicates like: is blue, is within the volume of a carbon atom, is within the volume of an elephant, etc. and generally any predicate type you'd like (including n place predicates).2 The atomic propositions in this language might look something like: "5, 0.487, -7098.6, 6000s is Blue" or "(1, 1, 1, 1s), (-1, -1, -1, 1s) contains an elephant." The first of these propositions says that a certain point in space-time is blue; the second says that there is an elephant between two points at one second after the universe starts. Presumably, at least the denotational content of most english propositions could be expressed in such a language (I think, mathematical claims aside).

Now imagine that we collect all of the atomic propositions in this language, and assign a joint distribution over them. Maybe we choose max entropy, doesn't matter. Would doing so really require us to assign 1 to every mathematical theorem? I can see why it would require us to assign 1 to every tautological Boolean combination of atomic propositions [for instance: "(1, 1, 1, 1s), (-1, -1, -1, 1s) contains an elephant OR ~((1, 1, 1, 1s), (-1, -1, -1, 1s) contains an elephant)], but that would follow naturally as a consequence of filling out the joint distribution. Similarly, all the Boolean contradictions would be assigned zero, just as a consequence of filling out the joint distribution table with a set of reals that sum to 1. 

A similar argument could be made using intuitions from algorithmic probability theory. Imagine that we know that some data was produced by a distribution which is output by a program of length n in a binary programming language. We want to figure out which distribution it is. So, we assign each binary string a prior probability of 2^-n. If the language allows for comments, then simpler distributions will be output by more programs, and we will add the probability of all programs that print that distribution.3 Sure, we might need an oracle to figure out if a given program outputs anything at all, but we would not need to assign a probability of 1 to Fermat's last theorem (or at least I can't figure out why we would). The data might be all of your sensory inputs, and n might be Graham's number; still, there's no reason such a distribution would need to assign 1 to every mathematical theorem. 


A Bayesian agent does not require mathematical omniscience, or logical (if that means anything more than Boolean) omniscience, but merely Boolean omniscience. All that Boolean omniscience means is that for whatever atomic propositions appear in the language (e.g., the language that forms the set of propositions that constitute the domain of the probability function) of the agent, any tautological Boolean combination of those propositions must be assigned a probability of 1, and any contradictory Boolean combination of those propositions must be assigned 0. As far as I can tell, the whole notion that Bayesian agents must assign 1 to tautologies and 0 to contradictions comes from the fact that when you fill out a table of joint distributions (or follow the Komolgorov axioms in some other way) all of the Boolean theorems get a probability of 1. This does not imply that you need to assign 1 to Fermat's last theorem, even if you are reasoning probabilistically in a language that is very expressive.4 

Some Ways To Prove This Wrong:

Show that a really expressive semantic language, like the one I gave above, implies PA if you allow Boolean operations on its atomic propositions. Alternatively, you could show that Solomonoff induction can express PA theorems as propositions with probabilities, and that it assigns them 1. This is what I tried to do, but I failed on both occasions, which is why I wrote this. 

[1] There are also interesting questions about the role of tautologies that rely on synonymy in probability theory, and whether they must be assigned a probability of 1, but I decided to keep it to mathematics for the sake of this post. 

[2] I think this language is ridiculous, and openly admit it has next to no real world application. I stole the idea for the language from Carnap.

[3] This is a sloppily presented approximation to Solomonoff induction as n goes to infinity. 

[4] The argument above is not a mathematical proof, and I am not sure that it is airtight. I am posting this to the discussion board instead of a full-blown post because I want feedback and criticism. !!!HOWEVER!!! if I am right, it does seem that folks on here, at MIRI, and in the Bayesian world at large, should start being more careful when they think or write about logical omniscience. 



Steelmaning AI risk critiques

26 Stuart_Armstrong 23 July 2015 10:01AM

At some point soon, I'm going to attempt to steelman the position of those who reject the AI risk thesis, to see if it can be made solid. Here, I'm just asking if people can link to the most convincing arguments they've found against AI risk.

EDIT: Thanks for all the contribution! Keep them coming...

Self-improvement without self-modification

3 Stuart_Armstrong 23 July 2015 09:59AM

This is just a short note to point out that AIs can self-improve without having to self-modify. So locking down an agent from self-modification is not an effective safety measure.

How could AIs do that? The easiest and the most trivial is to create a subagent, and transfer their resources and abilities to it ("create a subagent" is a generic way to get around most restriction ideas).

Or it the AI remains unchanged and in charge, it could change the whole process around itself, so that the whole process changes and improves. For instance, if the AI is inconsistent and has to pay more attention to problems that are brought to its attention than problems that aren't, it can start to act to manage the news (or the news-bearers) to hear more of what it wants. If it can't experiment on humans, it will give advice that will cause more "natural experiments", and so on. It will gradually try to reform its environment to get around its programmed limitations.

Anyway, that was nothing new or deep, just a reminder point I hadn't seen written out.


Oracle AI: Human beliefs vs human values

2 Stuart_Armstrong 22 July 2015 11:54AM

It seems that if we can ever define the difference between human beliefs and values, we could program a safe Oracle by requiring it to maximise the accuracy of human beliefs on a question, while keeping human values fixed (or very little changing). Plus a whole load of other constraints, as usual, but that might work for a boxed Oracle answering a single question.

This is a reason to suspect it will not be easy to distinguish human beliefs and values ^_^

AI: requirements for pernicious policies

7 Stuart_Armstrong 17 July 2015 02:18PM

Some have argued that "tool AIs" are safe(r). Recently, Eric Drexler decomposed AIs into "problem solvers" (eg calculators), "advisors" (eg GPS route planners), and actors (autonomous agents). Both solvers and advisors can be seen as examples of tools.

People have argued that tool AIs are not safe. It's hard to imagine a calculator going berserk, no matter what its algorithm is, but it's not too hard to come up with clear examples of dangerous tools. This suggests the solvers vs advisors vs actors (or tools vs agents, or oracles vs agents) is not the right distinction.

Instead, I've been asking: how likely is the algorithm to implement a pernicious policy? If we model the AI as having an objective function (or utility function) and algorithm that implements it, a pernicious policy is one that scores high in the objective function but is not at all what is intended. A pernicious function could be harmless and entertaining or much more severe.

I will lay aside, for the moment, the issue of badly programmed algorithms (possibly containing its own objective sub-functions). In any case, to implement a pernicious function, we have to ask these questions about the algorithm:

  1. Do pernicious policies exist? Are there many?
  2. Can the AI find them?
  3. Can the AI test them?
  4. Would the AI choose to implement them?

The answer to 1. seems to be trivially yes. Even a calculator could, in theory, output a series of messages that socially hack us, blah, take over the world, blah, extinction, blah, calculator finishes its calculations. What is much more interesting is some types of agents have many more pernicious policies than others. This seems the big difference between actors and other designs. An actor AI in complete control of the USA or Russia's nuclear arsenal has all sort of pernicious policies easily to hand; an advisor or oracle has much fewer (generally going through social engineering), a tool typically even less. A lot of the physical protection measures are about reducing the number of sucessfull pernicious policies the AI has a cess to.

The answer to 2. is mainly a function of the power of the algorithm. A basic calculator will never find anything dangerous: its programming is simple and tight. But compare an agent with the same objective function and the ability to do an unrestricted policy search with vast resources... So it seems that the answer to 2. does not depend on any solver vs actor division, but purely on the algorithm used.

And now we come to the big question 3., whether the AI can test these policies. Even if the AI can find pernicious policies that rank high on its objective function, it will never implement them unless it can ascertain this fact. And there are several ways it could do so. Let's assume that a solver AI has a very complicated objective function - one that encodes many relevant facts about the real world. Now, the AI may not "care" about the real world, but it has a virtual version of that, in which it can virtually test all of its policies. With a detailed enough computing power, it can establish whether the pernicious policy would be effective at achieving its virtual goal. If this is a good approximation of how the pernicious policy would behave in the real world, we could have a problem.

But extremely detailed objective functions are unlikely. But even simple ones can show odd behaviour if the agents gets to interact repeatedly with the real world - this is the issue with reinforcement learning. Suppose that the agent attempts a translation job, and is rewarded on the accuracy of its translation. Depending on the details of what the AI knows and who choose the rewards, the AI could end up manipulating its controllers, similarly to this example. The problem is that one there is any interaction, all the complexity of humanity could potentially show up in the reward function, even if the objective function is simple.

Of course, some designs make this very unlikely - resetting the AI periodically can help to alleviate the problem, as can choosing more objective criteria for any rewards. Lastly on this point, we should mention the possibility that human R&D, by selecting and refining the objective function and the algorithm, could take the roll of testing the policies. This is likely to emerge only in cases where many AI designs are considered, and the best candiates are retained based on human judgement.

Finally we come to the question of whether the AI will implement the policy if it's found it and tested it. You could say that the point of FAI is to create an AI that doesn't choose to implement pernicious policies - but, more correctly, the point of FAI is to ensure that very few (or zero) pernicious policies exist in the first place, as they all score low on the utility function. However, there are a variety of more complicated designs - satisficers, agents using crude measures - where the questions of "Do pernicious policies exist?" and "Would the AI choose to implement them?" could become quite distinct.


Conclusion: a more through analysis of AI designs is needed

A calculator is safe, because it is a solver, it has a very simple objective function, with no holes in the algorithm, and it can neither find nor test any pernicious policies. It is the combination of these elements that makes it almost certainly safe. If we want to make the same claim about other designs, neither "it's just a solver" or "it's objective function is simple" would be enough; we need a careful analysis.

Though, as usual, "it's not certainly safe" is a quite distinct claim from "it's (likely) dangerous", and they should not be conflated.

Examples of AI's behaving badly

25 Stuart_Armstrong 16 July 2015 10:01AM

Some past examples to motivate thought on how AI's could misbehave:

An algorithm pauses the game to never lose at Tetris.

In "Learning to Drive a Bicycle using Reinforcement Learning and Shaping", Randlov and Alstrom, describes a system that learns to ride a simulated bicycle to a particular location. To speed up learning, they provided positive rewards whenever the agent made progress towards the goal. The agent learned to ride in tiny circles near the start state because no penalty was incurred from riding away from the goal.

A similar problem occurred with a soccer-playing robot being trained by David Andre and Astro Teller (personal communication to Stuart Russell). Because possession in soccer is important, they provided a reward for touching the ball. The agent learned a policy whereby it remained next to the ball and “vibrated,” touching the ball as frequently as possible. 

Algorithms claiming credit in Eurisko: Sometimes a "mutant" heuristic appears that does little more than continually cause itself to be triggered, creating within the program an infinite loop. During one run, Lenat noticed that the number in the Worth slot of one newly discovered heuristic kept rising, indicating that had made a particularly valuable find. As it turned out the heuristic performed no useful function. It simply examined the pool of new concepts, located those with the highest Worth values, and inserted its name in their My Creator slots.

The AI, the best human advisor

7 Stuart_Armstrong 13 July 2015 03:33PM

A putative new idea for AI control; index here.

All my ideas on reduced impact AI follow the same format: some restriction to ensure a reduced impact (I generally use something like "could a future AI detect this one existed"?) and some "trick" to allow a way of still achieving a larger impact. Further more, there is always a "nice" way of achieving this larger impact (do the thing we really want you to do, and only that!), so that we hope the AI can be forced into that behaviour. Examples of "tricks" are:

In this post, I'll present another variant - one that I think is the most interesting so far. I won't change the reduced impact condition, but will present a different "trick".


AI pretends to be human

As in this post, imagine that we have a reduced-impact AI in a box that conceals it from us. The AI has some goal, and will output some data to further that goal, then turn itself off. And then we'll shake the box around or something, so we can't be sure that the AI was turned on.

There are two issues to deal with, here: a) The message itself is strong evidence that the AI was turned on, and b) if we exclude the message from the reduced impact, the message becomes dangerous.

To deal with a), first of all. This approach gets much easier if we have a whole brain emulation (WBE). Then we can house the WBE in the box with the AI. If the AI doesn't send out a message, then the WBE sends out is own message instead. There are variants we can try here if we don't have WBE - maybe some human volunteer willing to either die or take amnesia pills at the end of the experiment.

So, receiving a coherent message is not a surprising thing to come out the output channel. However, an AI motivated strongly by reduced impact, will simply seek to imitate as precisely as possible what the WBE would send. We still need a trick to get something useful.


AI pretends to be BEST human

Imagine that we wanted the AI to give us investment advice to make us rich. Say, a binary task: get us ten million dollars over the course of the year.

Now, this is the kind of thing that the WBE could conceivably achieve. We could run the WBE fast, for three subjective weeks, and it has a chance of giving us advice that would get us that rich. Assume that it only has 0.1% chance of doing so. The AI, on the other hand, could produce advice that makes us that rich with almost certainty.

Therefore if we get advice, follow it, and end up at least ten million dollars richer, we can conclude that the odds update of the AI having given us the advice (rather than the WBE doing so) is about 1:1000.

And this gives us the solution to b). The "trick", the escape clause, is by conditioning on the outcome of our investments. So we're asking: given that we have become rich (and given the other facts about the universe in 50 years time) what additional evidence is there that the AI was turned on?

Now, a sensible WBE is almost certain to not come up with a scheme that destroys the universe with self-tiling agents - or, indeed, does anything too crazy. We could instruct the WBE to this effect, and we're only running it for three weeks, which prevents the kind of drifts that you see in more extreme situations. Thus a message from the WBE, no matter how likely it is to make us rich, is almost certain to not lead to tilling the universe with self-replicating nano-bots.

So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.

Note that we need that the goal we are looking for be not too extreme in impact (the higher the impact of the goal, the more latitude the AI has to have impact itself, "hiding" inside the goal's impact), and we should restrict to event of probability, say, greater than 1/10,000 (as when the probability gets too small, odd events start happening).

It seems that this scheme could result in AIs that always give the very best possible advice a human (or WBE) could ever give, which would be a very useful result.


Whose probability?

We have a probability distribution over events. The AI itself will have a probability distribution over events. The future hypothetical super-super-AI it is using to compute reduced impact has a probability distribution over events - and the AI has a probability distribution over that probability distribution. If all of them agree on the probability of us getting richer (given WBE advice and given not), then this scheme should work.

If they disagree, there might be problems. A more complex approach could directly take into account the divergent probability estimates; I'll think of that and return to the issue later.

Moral AI: Options

9 Manfred 11 July 2015 09:46PM

Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).

One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.

The projects, in alphabetical order:

Most contemporary AI systems base their decisions solely on consequences, whereas humans also consider other morally relevant factors, including rights (such as privacy), roles (such as in families), past actions (such as promises), motives and intentions, and so on. Our goal is to build these additional morally relevant features into an AI system. We will identify morally relevant features by reviewing theories in moral philosophy, conducting surveys in moral psychology, and using machine learning to locate factors that affect human moral judgments. We will use and extend game theory and social choice theory to determine how to make these features more precise, how to weigh conflicting features against each other, and how to build these features into an AI system. We hope that eventually this work will lead to highly advanced AI systems that are capable of making moral judgments and acting on them.

Techniques: Top-down design, game theory, moral philosophy

Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. [Citation of inverse reinforcement learning] We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don't know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people's own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.

Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments

The future will see autonomous agents acting in the same environment as humans, in areas as diverse as driving, assistive technology, and health care. In this scenario, collective decision making will be the norm. We will study the embedding of safety constraints, moral values, and ethical principles in agents, within the context of hybrid human/agents collective decision making. We will do that by adapting current logic-based modelling and reasoning frameworks, such as soft constraints, CP-nets, and constraint-based scheduling under uncertainty. For ethical principles, we will use constraints specifying the basic ethical ``laws'', plus sophisticated prioritised and possibly context-dependent constraints over possible actions, equipped with a conflict resolution engine. To avoid reckless behavior in the face of uncertainty, we will bound the risk of violating these ethical laws. We will also replace preference aggregation with an appropriately developed constraint/value/ethics/preference fusion, an operation designed to ensure that agents' preferences are consistent with the system's safety constraints, the agents' moral values, and the ethical principles of both individual agents and the collective decision making system. We will also develop approaches to learn ethical principles for artificial intelligent agents, as well as predict possible ethical violations.

Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles

The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.

Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is

Autonomous AI systems will need to understand human values in order to respect them. This requires having similar concepts as humans do. We will research whether AI systems can be made to learn their concepts in the same way as humans learn theirs. Both human concepts and the representations of deep learning models seem to involve a hierarchical structure, among other similarities. For this reason, we will attempt to apply existing deep learning methodologies for learning what we call moral concepts, concepts through which moral values are defined. In addition, we will investigate the extent to which reinforcement learning affects the development of our concepts and values.

Techniques: Trying to identify learned moral concepts, unsupervised learning 


The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.

Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.

Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.

Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.

Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.

Presidents, asteroids, natural categories, and reduced impact

1 Stuart_Armstrong 06 July 2015 05:44PM

A putative new idea for AI control; index here.

EDIT: I feel this post is unclear, and will need to be redone again soon.

This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.


Extending niceness/reduced impact

I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).

This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.

What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize

  • Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]} 

where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.


Perfect and imperfect extensions

The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...

...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).

One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.

Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.


High impact from reduced impact

With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).

In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.

Then the top level AI charged with aiming the y coordinates is given the following task:

  • It is motivated to cause the correct y-coordinate to be outputted.
  • If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
  • It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
  • The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.

So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.

And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.

And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.


Failure modes

There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.

The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.


Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.

Grue, Bleen, and natural categories

3 Stuart_Armstrong 06 July 2015 01:47PM

A putative new idea for AI control; index here.

In a previous post, I looked at unnatural concepts such as grue (green if X was true, blue if it was false) and bleen. This was to enable one to construct the natural categories that extend AI behaviour, something that seemed surprisingly difficult to do.

The basic idea discussed in the grue post was that the naturalness of grue and bleen seemed dependent on features of our universe - mostly, that it was easy to tell whether an object was "currently green" without knowing what time it was, but we could not know whether the object was "currently grue" without knowing the time.

So the naturalness of the category depended on the type of evidence we expected to find. Furthermore, it seemed easier to discuss whether a category is natural "given X", rather than whether that category is natural in general. However, we know the relevant X in the AI problems considered so far, so this is not a problem.


Natural category, probability flows

Fix a boolean random variable X, and assume we want to check whether the boolean random variable Z is a natural category, given X.

If Z is natural (for instance, it could be the colour of an object, while X might be the brightness), then we expect to uncover two types of evidence:

  • those that change our estimate of X; this causes probability to "flow" as follows (or in the opposite directions):

  • ...and those that change our estimate of Z:

Or we might discover something that changes our estimates of X and Z simultaneously. If the probability flows to X and and Z in the same proportions, we might get:

What is an example of an unnatural category? Well, if Z is some sort of grue/bleen-like object given X, then we can have Z = X XOR Z', for Z' actually a natural category. This sets up the following probability flows, which we would not want to see:

More generally, Z might be constructed so that X∧Z, X∧¬Z, ¬X∧Z and ¬X∧¬Z are completely distinct categories; in that case, there are more forbidden probability flows:


In fact, there are only really three "linearly independent" probability flows, as we shall see.


Less pictures, more math

Let's represent the four possible state of affairs by four weights (not probabilities):

Since everything is easier when it's linear, let's set w11 = log(P(X∧Z)) and similarly for the other weights (we neglect cases where some events have zero probability). Weights are correspond to the same probabilities iff you get from one set to another by multiplying by a strictly positive number. For logarithms, this corresponds to adding the same constant to all the log-weights. So we can normalise our log-weights (select a single set of representative log-weights for each possible probability sets) by choosing the w such that

w11 + w12 + w21 + w22 = 0.

Thus the probability "flows" correspond to adding together two such normalised 2x2 matrices, one for the prior and one for the update. Composing two flows means adding two change matrices to the prior.

Four variables, one constraint: the set of possible log-weights is three dimensional. We know we have two allowable probability flows, given naturalness: those caused by changes to P(X), independent of P(Z), and vice versa. Thus we are looking for a single extra constraint to keep Z natural given X.

A little thought reveals that we want to keep constant the quantity:

w11 + w22 - w12 - w21.

This preserves all the allowed probability flows and rules out all the forbidden ones. Translating this back to a the general case, let "e" be the evidence we find. Then if Z is a natural category given X and the evidence e, the following quantity is the same for all possible values of e:

log[P(X∧Z|e)*P(¬X∧¬Z|e) / P(X∧¬Z|e)*P(¬X∧Z|e)].

If E is a random variable representing the possible values of e, this means that we want

log[P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E)]

to be constant, or, equivalently, seeing the posterior probabilities as random variables dependent on E:

  • Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]} = 0.

Call that variance the XE-naturalness measure. If it is zero, then Z defines a XE-natural category. Note that this does not imply that Z and X are independent, or independent conditional on E. Just that they are, in some sense, "equally (in)dependent whatever E is".


Almost natural category

The advantage of that last formulation becomes visible when we consider that the evidence which we uncover is not, in the real world, going to perfectly mark Z as natural, given X. To return to the grue example, though most evidence we uncover about an object is going to be the colour or the time rather than some weird combination, there is going to be somebidy who will right things like "either the object is green, and the sun has not yet set in the west; or instead perchance, those two statements are both alike in falsity". Upon reading that evidence, if we believe it in the slightest, the variance can no longer be zero.

Thus we cannot expect that the above XE-naturalness be perfectly zero, but we can demand that it be low. How low? There seems no principled way of deciding this, but we can make one attempt: that we cannot lower it be decomposing Z.

What do we mean by that? Well, assume that Z is a natural category, given X and the expected evidence, but Z' is not. Then we can define a new category boolean Y to be Z with high probability, and Z' otherwise. This will still have low XE-naturalness measure (as Z does) but is obviously not ideal.

Reversing this idea, we say Z defines a "XE-almost natural category" if there is no "more XE-natural" category that extends X∧Z (and the other for conjunctions). Technically, if

X∧Z = X∧Y,

Then Y must have equal or greater XE-naturalness measure to Z. And similarly for X∧¬Z, ¬X∧Z, and ¬X∧¬Z.

Note: I am somewhat unsure about this last definition; the concept I want to capture is clear (Z is not the combination of more XE-natural subvariables), but I'm not certain the definition does it.


Beyond boolean

What if Z takes n values, rather than being a boolean? This can be treated simply.

If we set the wjk to be log-weights as before, there are 2n free variables. The normalisation constraint is that they all sum to a constant. The "permissible" probability flows are given by flows from X to ¬X (adding a constant to the first column, subtracting it from the second) and pure changes in Z (adding constants to various rows, summing to 0). There are 1+ (n-1) linearly independent ways of doing this.

Therefore we are looking for 2n-1 -(1+(n-1))=n-1 independent constraints to forbid non-natural updating of X and Z. One basis set for these constraints could be to keep constant the values of

wj1 + w(j+1)2 - wj2 - w(j+1)1,

where j ranges between 1 and n-1.

This translates to variance constraints of the type:

  • Variance{log[ P(X∧{Z=j}|E)*P(¬X∧{Z=j+1}|E) / P(X∧{Z=j+1}|E)*P(¬X∧{Z=j}|E) ]} = 0.

But those are n different possible variances. What is the best global measure of XE-naturalness? It seems it could simply be

  • Maxjk Variance{log[ P(X∧{Z=j}|E)*P(¬X∧{Z=k}|E) / P(X∧{Z=k}|E)*P(¬X∧{Z=j}|E) ]} = 0.

If this quantity is zero, it naturally sends all variances to zero, and, when not zero, is a good candidate for the degree of XE-naturalness of Z.

The extension to the case where X takes multiple values is straightforward:

  • Maxjklm Variance{log[ P({X=l}∧{Z=j}|E)*P({X=m}∧{Z=k}|E) / P({X=l}∧{Z=k}|E)*P({X=m}∧{Z=j}|E) ]} = 0.

Note: if ever we need to compare the XE-naturalness of random variables taking different numbers of values, it may become necessary to divide these quantities by the number of variables involved, or maybe substitute a more complicated expression that contains all the different possible variances, rather than simply the maximum.


And in practice?

In the next post, I'll look at using this in practice for an AI, to evade presidential deaths and deflect asteroids.

Green Emeralds, Grue Diamonds

8 Stuart_Armstrong 06 July 2015 11:27AM

A putative new idea for AI control; index here.

When posing his "New Riddle of Induction", Goodman introduced the concepts of "grue" and "bleen" to show some of the problems with the conventional understanding of induction.

I've somewhat modified those concepts. Let T be a set of intervals in time, and we'll use the boolean X to designate the fact that the current time t belongs to T (with ¬X equivalent to t∉T). We'll define an object to be:

  • Grue if it is green given X (ie whenever t∈T), and blue given ¬X (ie whenever t∈T).
  • Bleen if it is blue given X, and green given ¬X.

At this point, people are tempted to point out the ridiculousness of the concepts, dismissing them because of their strange disjunctive definitions. However, this doesn't really solve the problem; if we take grue and bleen as fundamental concepts, then we have the disjunctively defined green and blue; an object is:

  • Green if it is grue given X, and bleen given ¬X.
  • Blue if it is bleen given X, and grue given ¬X.

Still, the categories green and blue are clearly more fundamental than grue and bleen. There must be something we can whack them with to get this - maybe Kolmogorov complexity or stuff like that? Sure someone on Earth could make a grue or bleen object (a screen with a timer, maybe?), but it would be completely artificial. Note that though grue and bleen are unnatural, "currently grue" (colour=green XOR ¬X) or "currently bleen" (colour=blue XOR ¬X) make perfect sense (though they require knowing X, an important point for later on).

But before that... are we so sure the grue and bleen categories are unnatural? Relative to what?


Welcome to Chiron Beta Prime

Chiron Beta Prime, apart from having its own issues with low-intelligence AIs, is noted for having many suns: one large sun that glows mainly in the blue spectrum, and multiple smaller ones glowing mainly in the green spectrum. They all emit in the totality of the spectrum, but they are stronger in those colours.

Because of the way the orbits are locked to each other, the green suns are always visible from everywhere. The blue sun rises and sets on a regular schedule; define T to be time when the blue sun is risen (so X="Blue sun visible, some green suns visible" and ¬X="Blue sun not visible, some green suns visible").

Now "green" is a well defined concept in this world. Emeralds are green; they glow green under the green suns, and do the same when the blue sun is risen. "Blue" is also a well-defined concept. Sapphires are blue. They glow blue under the blue sun and continue to do so (albeit less intensely) when it is set.

But "grue" is also a well defined concept. Diamonds are grue. They glow green when the green suns are the only ones visible, but glow blue under the glare of the blue sun.

Green, blue, and grue (which we would insist on calling green, blue and white) are thus well understood and fundamental concepts, that people of this world use regularly to compactly convey useful information to each other. They match up easily to fundamental properties of the objects in question (eg frequency of light reflected).

Bleen, on the other hand - don't be ridiculous. Sure, someone on Chiron Beta Prime could make a bleen object (a screen with a timer, maybe?), but it would be completely artificial.

In contrast, the inhabitants of Pholus Delta Secundus, who have a major green sun and many minor blue suns (coincidentally with exactly the same orbital cycles), feel that green, blue and bleen are the natural categories...


Natural relative to the (current) universe

We've shown that some categories that we see as disjunctive or artificial can seem perfectly natural and fundamental to beings in different circumstances. Here's another example:

A philosopher proposes, as thought experiment, to define a certain concept for every object. It's the weighted sum of the inverse of the height of an object (from the centre of the Earth), and its speed (squared, because why not?), and its temperature (but only on an "absolute" scale), and some complicated thing involving its composition and shape, and another term involving its composition only. And maybe we can add another piece for its total mass.

And then that philosopher proposes, to great derision, that this whole messy sum be given a single name, "Energy", and that we start talking about it as if it was a single thing. Faced with such an artificially bizarre definition, sensible people who want to use induction properly have no choice... but to embrace energy as one of the fundamental useful facts of the universe.

What these example show is that green, blue, grue, bleen, and energy are not natural or non-natural categories in some abstract sense, but relative to the universe we inhabit. For instance, if we had some strange energy' which used the inverse of the height cubed, then we'd have a useless category - unless we lived in five spacial dimensions.


You're grue, what time is it?

So how can we say that green and blue are natural categories in our universe, while grue and bleen are not? A very valid explanation seems to be the dependence on X - on the time of day. In our earth, we can tell whether objects are green or blue without knowing anything about the time. Certainly we can get combined information about an object's colour and the time of day (for instance by looking at emeralds out in the open). But we also expect to get information about the colour (by looking at an object in a lit basement) and the time (by looking at a clock). And we expect these pieces of information to be independent of each other.

In contrast, we never expect to get information about an object being currently grue or currently bleen without knowing the time (or the colour, for that matter). And information about the time can completely change our assessment as to whether an object is grue versus bleen. It would be a very contrived set of circumstances where we would be able to assert "I'm pretty sure that object is currently grue, but I have no idea about its colour or about the current time".

Again, this is a feature of our world and the evidence we see in it, not some fundamental feature of the categories of grue and bleen. We just don't generally seen green objects change into blue objects, nor do we typically learn about disjunctive statements of the type "colour=green XOR time=night" without learning about the colour and the time separately.

What about the grue objects on Chiron Beta Prime? There, people do see objects change colour regularly, and, upon investigation, they can detect whether an object is grue without knowing either the time or the apparent colour of the object. For instance, they know that diamond is grue, so they can detect some grue objects by a simple hardness test.

But what's happening is that the Chiron Beta Primers have correctly identified a fundamental category - the one we call white, or, more technically "prone to reflect light both in the blue and green parts of the spectrum" - that has different features on their planet than on ours. From the macroscopic perspective, it's as if we and they live in a different universe, hence grue means something to them and not to us. But the same laws of physics underlie both our worlds, so fundamentally the concepts converge - our white, their grue, mean the same things at the microscopic level.


Definitions open to manipulation

In the next post, I'll look at whether we can formalise "expect independent information about colour and time", and "we don't expect change to the time information to change our colour assessment."

But be warned. The naturalness of these categories is dependent on facts about the universe, and these facts could be changed. A demented human (or a powerful AI) could go through the universe, hiding everything in boxes, smashing clocks, and putting "current bleen detectors" all other the place, so that it suddenly becomes very easy to know statements like "colour=blue XOR time=night", but very hard to know about colour (or time) independently from this. So it would be easy to say "this object is currently bleen", but hard to say "this object is blue". Thus the "natural" categories may be natural now, but this could well change, so we must have care when using these definitions to program an AI.

Top 9+2 myths about AI risk

44 Stuart_Armstrong 29 June 2015 08:41PM

Following some somewhat misleading articles quoting me, I thought Id present the top 9 myths about the AI risk thesis:

  1. That we’re certain AI will doom us. Certainly not. It’s very hard to be certain of anything involving a technology that doesn’t exist; we’re just claiming that the probability of AI going bad isn’t low enough that we can ignore it.
  2. That humanity will survive, because we’ve always survived before. Many groups of humans haven’t survived contact with more powerful intelligent agents. In the past, those agents were other humans; but they need not be. The universe does not owe us a destiny. In the future, something will survive; it need not be us.
  3. That uncertainty means that you’re safe. If you’re claiming that AI is impossible, or that it will take countless decades, or that it’ll be safe... you’re not being uncertain, you’re being extremely specific about the future. “No AI risk” is certain; “Possible AI risk” is where we stand.
  4. That Terminator robots will be involved. Please? The threat from AI comes from its potential intelligence, not from its ability to clank around slowly with an Austrian accent.
  5. That we’re assuming the AI is too dumb to know what we’re asking it. No. A powerful AI will know what we meant to program it to do. But why should it care? And if we could figure out how to program “care about what we meant to ask”, well, then we’d have safe AI.
  6. That there’s one simple trick that can solve the whole problem. Many people have proposed that one trick. Some of them could even help (see Holden’s tool AI idea). None of them reduce the risk enough to relax – and many of the tricks contradict each other (you can’t design an AI that’s both a tool and socialising with humans!).
  7. That we want to stop AI research. We don’t. Current AI research is very far from the risky areas and abilities. And it’s risk aware AI researchers that are most likely to figure out how to make safe AI.
  8. That AIs will be more intelligent than us, hence more moral. It’s pretty clear than in humans, high intelligence is no guarantee of morality. Are you really willing to bet the whole future of humanity on the idea that AIs might be different? That in the billions of possible minds out there, there is none that is both dangerous and very intelligent?
  9. That science fiction or spiritual ideas are useful ways of understanding AI risk. Science fiction and spirituality are full of human concepts, created by humans, for humans, to communicate human ideas. They need not apply to AI at all, as these could be minds far removed from human concepts, possibly without a body, possibly with no emotions or consciousness, possibly with many new emotions and a different type of consciousness, etc... Anthropomorphising the AIs could lead us completely astray.
Lists cannot be comprehensive, but they can adapt and grow, adding more important points:
  1. That AIs have to be evil to be dangerous. The majority of the risk comes from indifferent or partially nice AIs. Those that have some goal to follow, with humanity and its desires just getting in the way – using resources, trying to oppose it, or just not being perfectly efficient for its goal.
  2. That we believe AI is coming soon. It might; it might not. Even if AI is known to be in the distant future (which isn't known, currently), some of the groundwork is worth laying now.


​My recent thoughts on consciousness

0 AlexLundborg 24 June 2015 12:37AM

I have lately come to seriously consider the view that the everyday notion of consciousness doesn’t refer to anything that exists out there in the world but is rather a confused (but useful) projection made by purely physical minds onto their depiction of themselves in the world. The main influences on my thinking are Dan Dennett, (I assume most of you are familiar with him)  and to a lesser extent Yudkowsky (1) and Tomasik (2). To use Dennett’s line of thought: we say that honey is sweet, that metal is solid or that a falling tree makes a sound, but the character tag of sweetness and sounds is not in the world but in the brains internal model of it. Sweetness in not an inherent property of the glucose molecule, instead, we are wired by evolution to perceive it as sweet to reward us for calorie intake in our ancestral environment, and there is neither any need for non-physical sweetness-juice in the brain – no, it's coded (3). We can talk about sweetness and sound as if being out there in the world but in reality it is a useful fiction of sorts that we are "projecting" out into the world. The default model of our surroundings and ourselves we use in our daily lives (the manifest image, or ’umwelt’) is puzzling to reconcile with the scientific perspective of gluons and quarks. We can use this insight to look critically on how we perceive a very familiar part of the world: ourselves. It might be that we are projecting useful fictions onto our model of ourselves as well. Our normal perception of consciousness is perhaps like the sweetness of honey, something we think exist in the world, when it is in fact a judgement about the world made (unconsciously) by the mind.

What we are pointing at with the judgement “I am conscious” is perhaps the competence that we have to access states about the world, form expectations about those states and judge their value to us, coded in by evolution. That is, under this view, equivalent with saying that suger is made of glucose molecules, not sweetness-magic. In everyday language we can talk about suger as sweet and consciousness as “something-to-be-like-ness“ or “having qualia”, which is useful and probably necessary for us to function, but that is a somewhat misleading projection made by our ​​world-accessing and assessing consciousness that really exists in the world. That notion of consciousness is not subject to the Hard Problem, it may not be an easy problem to figure out how consciousness works, but it does not appear impossible to explain it scientifically as pure matter like anything else in the natural world, at least in theory. I’m pretty confident that we will solve consciousness, if we by consciousness mean the competence of a biological system to access states about the world, make judgements and form expectations. That is however not what most people mean when they say consciousness. Just like ”real” magic refers to the magic that isn’t real and the magic that is real, that can be performed in the world, is not “real magic”, “real” consciousness turns out to be a useful, but misleading assessment (4). We should perhaps keep the word consciousness but adjust what we mean when we use it, for diplomacy.

Having said that, I still find myself baffled by the idea that I might not be conscious in the way I’ve found completely obvious before. Consciousness seems so mysterious and unanswerable, so it’s not surprising then that the explanation provided by physicalists like Dennett isn’t the most satisfying. Despite that, I think it’s the best explanation I've found so far, so I’m trying to cope with it the best I can. One of the problems I’ve had with the idea is how it has required me to rethink my views on ethics. I sympathize with moral realism, the view that there exist moral facts, by pointing to the strong intuition that suffering seems universally bad, and well-being seems universally good. Nobody wants to suffer agonizing pain, everyone wants beatific eudaimonia, and it doesn't feel like an arbitrary choice to care about the realization of these preferences in all sentience to a high degree, instead of any other possible goal like paperclip maximization. It appeared to me to be an unescapable fact about the universe that agonizing pain really is bad (ought to be prevented), that intelligent bliss really is good (ought to be pursued), like a label to distinguish wavelength of light in the brain really is red, and that you can build up moral values from there. I have a strong gut feeling that the well-being of sentience matters, and the more capacity a creature has of receiving pain and pleasure the more weight it is given, say a gradience from beetles to posthumans that could perhaps be understood by further inquiry of the brain (5). However, if it turns out that pain and pleasure isn’t more than convincing judgements by a biological computer network in my head, no different in kind to any other computation or judgement, the sense of seriousness and urgency of suffering appears to fade away. Recently, I’ve loosened up a bit to accept a weaker grounding for morality: I still think that my own well-being matter, and I would be inconsistent if I didn’t think the same about other collections of atoms that appears functionally similar to ’me’, who also claim, or appear, to care about their well-being. I can’t answer why I should care about my own well-being though, I just have to. Speaking of 'me': personal identity also looks very different (nonexistent?) under physicalism, than in the everyday manifest image (6).

Another difficulty I confront is why e.g. colors and sounds looks and sounds the way they do or why they have any quality at all, under this explanation. Where do they come from if they’re only labels my brain uses to distinguish inputs from the senses? Where does the yellowness of yellow come? Maybe it’s not a sensible question, but only the murmuring of a confused primate. Then again, where does anything come from? If we can learn to shut up our bafflement about consciousness and sensibly reduce it down to physics – fair enough, but where does physics come from? That mystery remains, and that will possibly always be out of reach, at least probably before advanced superintelligent philosophers. For now, understanding how a physical computational system represents the world, creates judgements and expectations from perception presents enough of a challenge. It seems to be a good starting point to explore anyway (7).

I did not really put forth any particularly new ideas here, this is just some of my thoughts and repetitions of what I have read and heard others say, so I'm not sure if this post adds any value. My hope is that someone will at least find some of my references useful, and that it can provide a starting point for discussion. Take into account that this is my first post here, I am very grateful to receive input and criticism! :-)

  1. Check out Eliezer's hilarious tear down of philosophical zombies if you haven't already
  2. http://reducing-suffering.org/hard-problem-consciousness/
  3. [Video] TED talk by Dan Dennett http://www.ted.com/talks/dan_dennett_cute_sexy_sweet_funny
  4. http://ase.tufts.edu/cogstud/dennett/papers/explainingmagic.pdf
  5. Reading “The Moral Landscape” by Sam Harris increased my confidence in moral realism. Whether moral realism is true of false can obviously have implications for approaches to the value learning problem in AI alignment, and for the factual accuracy of the orthogonality thesis
  6. http://www.lehigh.edu/~mhb0/Dennett-WhereAmI.pdf
  7. For anyone interested in getting a grasp of this scientific challenge I strongly recommend the book “A User’s Guide to Thought and Meaning” by Ray Jackendoff.

Edit: made some minor changes and corrections. Edit 2: made additional changes in the first paragraph for increased readability.


[Link] Self-Representation in Girard’s System U

2 Gunnar_Zarncke 18 June 2015 11:22PM

Self-Representation in Girard’s System U, by Matt Brown and Jens Palsberg:

In 1991, Pfenning and Lee studied whether System F could support a typed self-interpreter. They concluded that typed self-representation for System F “seems to be impossible”, but were able to represent System F in Fω. Further, they found that the representation of Fω requires kind polymorphism, which is outside Fω. In 2009, Rendel, Ostermann and Hofer conjectured that the representation of kind-polymorphic terms would require another, higher form of polymorphism. Is this a case of infinite regress?
We show that it is not and present a typed self-representation for Girard’s System U, the first for a λ-calculus with decidable type checking. System U extends System Fω with kind polymorphic terms and types. We show that kind polymorphic types (i.e. types that depend on kinds) are sufficient to “tie the knot” – they enable representations of kind polymorphic terms without introducing another form of polymorphism. Our self-representation supports operations that iterate over a term, each of which can be applied to a representation of itself. We present three typed self-applicable operations: a self-interpreter that recovers a term from its representation, a predicate that tests the intensional structure of a term, and a typed continuation-passing-style (CPS) transformation – the first typed self-applicable CPS transformation. Our techniques could have applications from verifiably type-preserving metaprograms, to growable typed languages, to more efficient self-interpreters.
Emphasis mine. That seems to be a powerful calculus for writing self-optimizing AI programs in...

See also the lambda-the-ultimate comment thread about it.

The president didn't die: failures at extending AI behaviour

9 Stuart_Armstrong 10 June 2015 04:00PM

A putative new idea for AI control; index here.

In a previous post, I considered the issue of an AI that behaved "nicely" given some set of circumstances, and whether we could extend that behaviour to the general situation, without knowing what "nice" really meant.

The original inspiration for this idea came from the idea of extending the nice behaviour of "reduced impact AI" to situations where they didn't necessarily have a reduced impact. But it turned out to be connected with "spirit of the law" ideas, and to be of potentially general interest.

Essentially, the problem is this: if we have an AI that will behave "nicely" (since this could be a reduced impact AI, I don't use the term "friendly", which denotes a more proactive agent) given X, how can we extend its "niceness" to ¬X? Obviously if we can specify what "niceness" is, we could just require the AI to do so given ¬X. Therefore let us assume that we don't have a good definition of "niceness", we just know that the AI has that given X.

To make the problem clearer, I chose an X that would be undeniably public and have a large (but not overwhelming) impact: the death of the US president on a 1st of April. The public nature of this event prevents using approaches like thermodynamic miracles to define counterfactuals.

I'll be presenting a solution in a subsequent post. In the meantime, to help better understand the issue, here's a list of failed solutions:


First Failure: maybe there's no problem

Initially, it wasn't clear there was a problem. Could we just expect niceness to extend naturally? But consider the following situation: assume the vice president is a warmonger, who will start a nuclear war if ever they get into power (but is otherwise harmless).

Now assume the nice AI has the conditional action criteria: "if the vice president ever becomes president, launch a coup". This is safe, it can be extended to the ¬X situation in the way we want.

However, conditioning on X, that criteria is equivalent with "launch a coup on the 2nd of April". And if the AI has that criteria, then extending it to ¬X is highly non-safe. This illustrates that there is a real problem here - the coup example is just one of the myriad of potential issues that could arise, and we can't predict them all.


Second failure: don't condition on X

Maybe the trick could be preventing the AI from conditioning on X (for anything)? If the AI itself can't tell the difference between X and ¬X, wouldn't its nice behaviour extend?

But this doesn't help. The AI could merely condition on things tightly coupled with X. Like news reports of the president's demise, or a conjunction of other events almost impossible under ¬X.


Third failure: disbelieve the truth

In this design, the AI simply assumes that X happens, whatever the evidence. Therefore it would stay nice, whatever happened.

But this results in an AI who's beliefs are strongly decoupled with reality. The AI must believe that there is a vast conspiracy concealing the president's death; the more it investigates this conspiracy, and fails to find it, the more vast and powerful it becomes. What if you show it undeniable, impossible-to-fake-by-the-laws-of-physics evidence of the president's demise? Then it starts to believe that the laws of physics are all wrong.

Though the AI is "nice", it's niceness is appropriate for a world of invisible conspiracies and alternate physics - not for the real world. It's clear how such an AI could start doing crazy and dangerous things as its view of reality starts to diverge.


Fourth failure: hack the models

This example is due to Daniel Dewey. He imagined that the AI would maintain a full collection of models and update them normally. But before making a prediction, each model would get hacked to artificially set ¬X to X.

To see how this could fail, imagine all evidence about the world is stored in a giant vector of data. These vectors are calibrated correctly, then ¬X is flipped to X. The problem is that, while that is small change to the vector, it could be a huge change to the world that it encodes.

For instance, let's go back to the warmongering vice president, and assume that nuclear war is inevitable, given X. Then the AI will walk around in a peaceful world, update its models - and act as if it was in a nuclear wasteland, because those are the only possibilities, given X. Essentially, the AI will move through our universe, harvesting information that would inform its actions in a parallel universe - and acting as if it existed there instead of here.

For instance, it could wander into a flower show where someone is talking about difficulties growing roses in southern Quebec. It adds this data to its vector, noting that the soil there must be a bit unsuitable to plant growth. It therefore concludes that it must write to the (non-existent) Third God-Emperor of America and advise it to give up on the Quebec Anglican Protectorate, which must be misreporting their agriculture output, given this data.

It's interesting to contrast this AI with the previous one. Suppose that the nuclear war further implies that Paris must be a smoking crater. And now both AIs must walk around a clearly bustling and intact Paris. The disbelieving AI must conclude that this is an elaborate ruse - someone has hidden the crater from its senses, put up some fake building, etc... The model-hacking AI, meanwhile, acts as if it's in a smouldering crater, with the genuine Paris giving it information as to what it should do: it sees an intact army barracks, and starts digging under the "rubble" to see if anything "remains" of that barracks.

It would be interesting to get Robin Hanson to try and reconcile these AIs' beliefs ^_^


Fifth failure: Bayes nets and decisions

It seems that a Bayes net would be our salvation. We could have dependent nodes like "warmongering president", "nuclear war", or "flower show". Then we could require that the AI makes its decision dependent only on the states of these dependent nodes. And never on the original X/¬X node.

This seems safe - after all, the AI is nice given X. And if we require the AI's decisions be dependent only on subordinate nodes, then it must be nice dependent on the subordinate nodes. Therefore X/¬X is irrelevant, and the AI is always nice.

Except... Consider what a "decision" is. A decision could be something simple, or it could be "construct a sub AI that will establish X versus ¬X, and do 'blah' if X, and 'shmer' if ¬X". That's a perfectly acceptable decision, and could be made conditional on any (or all) of the subordinate nodes. And if 'blah' is nice while 'shmer' isn't, we have the same problem.


Six failure: Bayes nets and unnatural categories

OK, if decisions are too general, how about values for worlds? We take a lot of nodes, subordinate to X/¬X, and require that the AI define its utility or value function purely in terms of the states of these subordinate nodes. Again, this seems safe. The AI's value function is safe given X, by assumption, and is defined in terms of subordinate nodes that "screen off" X/¬X.

And that AI is indeed safe... if the subordinate nodes are sensible. But they're only sensible because I've defined them using terms such as "nuclear war". But what if a node is "nuclear war if X and peace in our time if ¬X"? That's a perfectly fine definition. But such nodes mean that the value function given ¬X need not be safe in any way.

This is somewhat connected with the Grue and Bleen issue, and addressing that is how I'll be hoping to solve the general problem.

View more: Next