[Link] SSC: It's Bayes All The Way Up

2 Houshalter 28 September 2016 06:06PM

The AI That Pretends To Be Human

1 Houshalter 02 February 2016 07:39PM

The hard part about containing AI, is restricting it's output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.

A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.

Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won't use the full power of superintelligence, unless it really needs it.

The above method isn't perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.

We don't really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.

Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?

Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.

We don't need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI's ability to Do Sneaky Things, since humans don't do those things.

Here is my idea on how to achieve this. It's the same as my limited computing power idea mentioned earlier, but applied to "appearing human" as a goal, instead of minimizing computing power.

We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.

The human AI's first goal is to produce a correct solution. It will predict how likely it's solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.

It's second goal will be to fool the second AI. It will do everything in it's power to make the solution appear like it was produced by a human.

The second AI tries to predict how likely it's input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.

This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that the first AI only cares about the outcome of this test. The test doesn't need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it's own writeup somewhere. It's an important tool in the FAI toolkit. It's a relatively simple method that can work even on current AIs.)

Another point is the second AI should have more computing power than the first AI. That way the first AI can't get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it's efforts on being indistinguishable from a human, even by an agent smarter than itself.

This method also relies on AI boxing. The first AI needs to be contained in a box so that it's only output goes to a text terminal or the second AI. However it doesn't require any of the more complicated machinery required for other reduced impact methods.

The AIs also aren't agents, in the sense they don't need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.


An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.

Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.

Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.

This would require some modifications to the above method, so that the AI doesn't try to use the robot to hack it's own brain or anything. This could be done by conditioning it on a counterfactual world where it's outputs are going to the second AI rather than a real robot.

Another point is the AI doesn't need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.


I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.

It's quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)

Reposted from my blog.

Against Expected Utility

-3 Houshalter 23 September 2015 09:21PM

Expected utility is optimal as the number of bets you take approaches infinity. You will lose bets on some days, and win bets on other days. But as you take more and more bets, the day to day randomness cancels out.

Say you want to save as many lives as possible. You can plug "number of lives saved" into an expected utility maximizer. And as the amount of bets it takes increases, it will start to save more lives than any other method.

But the real world obviously doesn't have an infinite number of bets. And following this algorithm in practice will get you worse results. It is not optimal.

In fact, as Pascal's Mugging shows, this could get arbitrarily terrible. An agent following expected utility would just continuously make bets with muggers and worship various religions, until it runs out of resources. Or worse, the expected utility calculations don't even converge, and the agent doesn't make any decisions.

So how do we fix it? Well we could just go back to the original line of reasoning that led us to expected utility, and fix it for finite cases. Instead of caring what method does the best on infinite bets, we might say we want the one that does the best the most on finite cases. That would get you median utility.

For most things, median utility will approximate expected utility. But for very very small risks, it will ignore them. It only cares that it does the best in most possible worlds. It won't ever trade away utility from the majority of your possible worlds to very very unlikely ones.

A naive implementation of median utility isn't actually viable, because at different points in time, the agent might make inconsistent decisions. To fix this, it needs to decide on policies instead of individual decisions. It will pick a decision policy which it believes will lead to the highest median outcome.

This does complicate making a real implementation of this procedure. But that's what you get when you generalize results, and try to make things work on the messy real world. Instead of idealized infinite worlds. The same issue occurs in the multi-armed bandit problem. Where the optimal infinite solution is simple, but finite solutions are incredibly complicated (or simple but require brute force.)

But if you do this, you don't need the independence axiom. You can be consistent and avoid money pumping without it. By not making decisions in isolation, but considering the entire probability space of decisions you will ever make. And choosing the best policies to navigate them.

It's interesting to note this actually solves some other problems. Such an agent would pick a policy that one-boxes on Newcomb's problems, simply because that is the optimal policy. Whereas a straightforward implementation of expected utility doesn't care.


But what if you really like the other mathematical properties of expected utility? What if we can just keep it and change something else? Like the probability function or the utility function?

Well the probability function is sacred IMO. Events should have the same probability of happening (given your prior knowledge), regardless what utility function you have, or what you are trying to optimize. And it's probably inconsistent too. An agent could exploit you. By giving you bets in the areas where your beliefs are forced to be different from reality.

The utility function is not necessarily sacred though. It is inherently subjective, with the goal of just producing the behavior we want. Maybe there is some modification to it that could fix these problems.

It seems really inelegant to do this. We had a nice beautiful system where you could just count the number of lives saved, and maximize that. But assume we give up on that. How can we change the utility function to make it work?

Well you could bound utility to get out of mugging situations. After a certain level, your utility function just stops. It can't get any higher.

But then you are stuck with a bound. If you ever reach it, then you suddenly stop caring about saving any more lives. Now it's possible that your true utility function really is bounded. But it's not a fully general solution for all utility functions. And I don't believe that human utility is actually bounded, but that will have to be a different post.

You could transform the utility function so it asymptotic. But this is just a continuous bound. It doesn't solve much. It still makes you care less and less about obtaining more utility, the closer you get to it.

Say you set your asymptote around 1,000. It can be much larger, but I need an example that is manageable. Now, what happens if you find yourself to exist in a world where all utilities are multiplied by a large number? Say 1,000. E.g. you save a 1,000 lives in situations where before, you would have saved only 1.

An example asymptoting function that is capped at 1,000. Notice how 2,000 is only slightly higher than 1,000, and everything after that is basically flat.

Now the utility of each additional life is diminishing very quickly. Saving 2,000 lives might have only 0.001% more utility than 1,000 lives.

This means that you would not take a 1% risk of losing 1,000 people, for a 99% chance at saving 2,000.

This is the exact opposite situation of Pascal's mugging! The probability of the reward is very high. Why are we refusing such an obviously good trade?

What we wanted to do was make it ignore really low probability bets. What we actually did was just make it stop caring about big rewards, regardless of the probability.

No modification to it can fix that. Because the utility function is totally indifferent to probability. That's what the decision procedure is for. That's where the real problem is.


In researching this topic I've seen all kinds of crazy resolutions to Pascal's Mugging. Some try to attack the exact thought experiment of an actual mugger. And miss the general problem of low probability events with large rewards. Others try to come up with clever arguments why you shouldn't pay the mugger. But not any general solution to the problem. And not one that works under the stated premises, where you care about saving human lives equally, and where you assign the mugger less than 1/3↑↑↑3 probability.

In fact Pascal's Mugger was originally written just to be a formalization of Pascal's original wager. Pascal's wager was dismissed for reasons like involving infinite utilities, and the possibility of an "anti-god" that exactly cancels the benefits out. Or that God wouldn't reward fake worshippers. People mostly missed the whole point about whether or not you should take low probability, high reward bets.

Pascal's Mugger showed that, no, it works fine in finite cases, and the probabilities do not have to exactly cancel each other out

Some people tried to fix the problem by adding hacks on top of the probability or utility functions. I argued against these solutions above. The problem is fundamentally with the decision procedure of expected utility.

I've spoken to someone who decided to just bite the bullet. He accepted that our intuition about big numbers is probably wrong, and we should just do what the math tells us.

But even that doesn't work. One of the points made in the original Pascal's Mugging post is that EU doesn't even converge. There is a hypothesis which has even less probability than the mugger, but promises 3↑↑↑↑3 utility. And a hypothesis even smaller than that which promises 3↑↑↑↑↑3 utility, and so on. Expected utility is utterly dominated by increasingly more improbable hypotheses. The expected utility of all actions approaches positive or negative infinity.

Expected utility is at the heart of the problem. We don't really want the average of our utility function over all possible worlds. No matter how big the numbers are or improbable they may be. We don't really want to trade away utility from the majority of our probability mass to infinitesimal slices of it.

The whole justification for EU being optimal in the infinite case, doesn't apply to the finite real world. The axioms that imply you need it to be consistent aren't true if you don't assume independence. So it's not sacred, and we can look at alternatives.

Median utility is just a first attempt at an alternative. We probably don't really want to maximize median utility either. Stuart Armstrong suggests using the mean of quantiles. There are probably better methods too. In fact there is an entire field of summary statistics and robust statistics, that I've barely looked at yet.

We can generalize and think of agents has having two utility functions. The regular utility function, which just gives a numerical value representing how preferable an outcome is. And a probability preference function, which gives a numerical value to each probability distribution of utilities.

Imagine we want to create an AI which acts the same as the agent would, given the same knowledge. Then we would need to know both of these functions. Not just the utility function. And they are both subjective, with no universally correct answer. Any function, so long as it converges (unlike expected utility), should produce perfectly consistent behavior.

Summoning the Least Powerful Genie

-1 Houshalter 16 September 2015 05:10AM

Stuart Armstrong recently posted a few ideas about restraining a superintelligent AI so that we can get useful work out of it. They are based on another idea of his, reduced impact. This is a quite elaborate and complicated way of limiting the amount of optimization power an AI can exert on the world. Basically, it tries to keep the AI from doing things that would make the world look too different than it already is.

First, why go to such great lengths to limit the optimization power of a superintelligent AI? Why not just not make it superintelligent to begin with? We only really want human level AI, or slightly above human level. Not a god-level being we can't even comprehend.

We can control the computer it is running on after all. We can just give it slower processors, less memory, and perhaps even purposely throttle it's code. E.g. restricting the size of it's neural network. Or other parameters that affect it's intelligence.

The counterargument to this is that it might be quite tricky to limit AI intelligence. We don't know how much computing power is enough. We don't know where "above human level" ends and "dangerous superintelligence" begins.

The simplest way would be to just run copies of the AI repeatedly, increasing it's computing power each time, until it solves the problem.

I have come up with a more elegant solution. Put a penalty on the amount of computing power the AI uses. This is put in it's utility function. The more computing power - and therefore intelligence and optimization - the AI uses, the more it is penalized. So it has an incentive to be as stupid as possible. Only using the intelligence necessary to solve the problem.

But we do want the AI to use as much computational resources as it needs to solve the problem. Just no more. So the penalty should be conditional on actually solving the problem it is given.

If the solution is probabilistic, then the penalty is only applied after reaching a plan that has a certain probability of success. This might need to be measured by another AI which is not resource constrained, but only does prediction.

To give a concrete example, lets say I give the AI a task. Say, I ask it to come up with a plan to collect 10,080 paperclips.

Another AI is given unlimited resources to do purely prediction. Given a plan, it predicts the probability that it will succeed or fail. This AI is safe in the sense that it is not an agent. It has no goals, and just makes predictions about things.

The main AI does the optimization. It tries to generate a plan that has the highest probability of succeeding.

Normally, this might involve extremely overoptimized plans for building nanotechnology and taking over the world. It uses all of it's available computing power. It tries to become as intelligent as possible by rewriting it's code. Perhaps it becomes thousands of times more intelligent than humans. Or millions of times more. It finds an answer that has 99.99% probability of succeeding.

However, now we give it a different utility function. We instead have it minimize the time it takes to get to a plan that has a 90% chance of succeeding.

Under a time constraint, the AI races to get to a plan. It tries to be as efficient as possible. It doesn't invest in any meta level improvements unless they really help it. It doesn't try to engineer complicated nanotechnology. That would take precious time.

Effectively, we have summoned a genie that is only just as powerful as it needs to be to fulfill our wish. And not any more powerful. It actually tries to be as stupid as possible.

There are other possible constraints we could use, or use in addition to this. Minimizing time limits intelligence because it gets fewer CPU cycles. We could also have it minimize memory or hard drive space, or any other computing resource.

We could also put a penalty on the complexity of the plan it produces. Perhaps measuring that by it's length. The simplest solution might prevent certain kinds of over-optimization. E.g. inserting plans for nanotechnology into it.

It's worth noting that you can't even create a paperclip maximizer in this system. You can't say "collect as many paperclips as possible". It has to be bounded. There needs to be a pass or fail test. E.g. "come up with a plan to collect 10,080 paperclips."

It's been noted in the past that bounding the goal isn't enough. The AI might then start maximizing the probability that it will achieve it's goal. E.g. building elaborate sensors to make sure it hasn't miscounted. Making as many redundant paperclips as possible, just in case something happens to them. You are still summoning an incredibly powerful genie, which might overoptimize.

This gets around that by only having it care about having a >90% chance of getting 10,080 paperclips. After that it stops optimizing.

Now this is not a solution to FAI, or even necessarily a usable reduced impact strategy. It will still try to find any loopholes it can in your task. What it does is limits the power of the genie trying to exploit them. It always takes the stupidest, simplest strategy that has a decent chance of working.

One failure case that bothers me is the AI just shoving a slight modification of it's own source code into the output. One which doesn't have a resource constraint. You may only be able to use it in cases where that isn't possible.

There are many cases where that isn't possible. The AI needs to also create a plan to obtain a computer to run it. And it needs to solve it's own version of the value loading problem, to get the copy to care about paperclips. So that adds complexity to that plan in some cases.

If we apply the complexity penalty I described above, then the source code for the AI must also be very short, or it can't do this.

And this requires obtaining it's own source code. An ability we can potentially restrict (you, as a human can't access your own source code, can you?) Without that it would need to reinvent AI, which takes resources.

But it is an interesting idea I had and I thought I'd share it.

Reposted from my blog.

Approximating Solomonoff Induction

6 Houshalter 29 May 2015 12:23PM

Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.

Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.

Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.

But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.

We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.

Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.

Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.

And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.

We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.

And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.

This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.

But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.

There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.

Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.

It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.