Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: tukabel 15 April 2017 12:46:40PM 0 points [-]

What if someone proves that advanced AGI (or even some dumb but sophisticated AI) cannot be "contained" nor reliably guaranteed to be "friendly"/"aligned"/etc. (whatever it may mean) ? Can be something vaguely goedelian, along the lines of "any sufficiently advanced system ...".

Comment author: Stuart_Armstrong 16 April 2017 11:56:21AM 0 points [-]

Those godelian style arguments (like the no-free-lunch theorems) work much better in theory than in practice. We only need a reasonably high probability of the AI being contained or friendly or aligned...

Comment author: Stuart_Armstrong 14 April 2017 09:04:39AM 11 points [-]

"X has personally distanced himself from Y" is the sneaky way to say "X has nothing to do with Y, but we'll imply that he might be connected somehow".

ALBA: can you be "aligned" at increased "capacity"?

3 Stuart_Armstrong 13 April 2017 07:23PM

Crossposted at the Intelligent Agents Forum.

I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.

The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.


These are not the lands of your forefathers

Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.

Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.

The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.

And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.


The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.

Comment author: Yosarian2 12 April 2017 02:24:11PM *  0 points [-]

If it's capable of self-modifying, then it could do weirder things.

For example, let's say the AI knows that news source X will almost always push stories in favor of action Y. (Fox News will almost always push information that supports the argument we should bomb the middle east, The Guardian will almost always push information that supports us becoming more socailist, whatever.) If the AI wants to bias itself in favor of thinking that action Y will create more utility, what if it self-modifies to first convince itself that news source X is a much more reliable source of information then it actually is and to weigh that information more heavily in it's future analysis?

If it can't self-modify directly, it could maybe do tricky things involving only observing the desired information source at key moments with the goal of increasing it's own confidence in that information source, and then once it has modified it's own confidence sufficiently then it looks at that information source to find the information it is looking for.

(Again, this sounds crazy, but keep in mind humans do this stuff to themselves all the time.)

Ect. Basically what this all boils down to is the AI doesn't really care about what happens in the real world, it's not trying to actually accomplish a goal; instead it's primary objective is to make itself think that it has an 80% chance of accomplishing the goal (or whatever), and once it does that it doesn't really matter if the goal happens or not. It has a built in motivation to try to trick itself.

Comment author: Stuart_Armstrong 13 April 2017 10:37:06AM 0 points [-]

If it's capable of self-modifying, then it could do weirder things.

Yes, but much weirder than you're imagining :-) This agent design is highly unstable, and will usually self-modify into something else entirely, very fast (see the top example where it self-modifies into a non-transitive agent).

If the AI wants to bias itself in favor of [...], what if it self-modifies to first convince itself that news source [...]

What is the expected utility from that bias action (given the expected behaviour after)? The AI has to make that bias decision while not being biased. So this doesn't get round the conservation of expected evidence.

Comment author: entirelyuseless 12 April 2017 01:36:05PM 2 points [-]

You can't look for evidence of X rather than evidence against, in the sense of conservation of expected evidence. But this just means that the amount of the move multiplied by the probability will be equal. It does not mean that the probability of finding evidence in favor is 50% and the probability of finding evidence against is 50%. So in that sense, you can indeed look for evidence in favor, by looking in places that have a very high probability of evidence in favor and low probability of evidence against; it is just that if you unluckily happen to find evidence against, it will be extremely strong evidence against.

Comment author: Stuart_Armstrong 13 April 2017 10:32:41AM 0 points [-]

Yes. I call that informally playing the variance. You want to look somewhere where there is the highest probability of a swing into the range that you want.

Comment author: Yosarian2 11 April 2017 03:06:09PM 0 points [-]

Ah, interesting, I understand better now what you're saying. That makes more sense, thank you.

Here's another possible failure mode then; if the AI's goal is just to manipulate it's own expected utility, and it calculates expected utility using some Bayesian method of modifying priors with new information, could it selectively seek out new information to convince itself that what it was already going to do is going to have an expected utility in the range of .8 and game the system that way? I know that sounds strange but humans do stuff like that all the time.

Comment author: Stuart_Armstrong 12 April 2017 07:22:44AM 2 points [-]

I can't bias its information search (looking for evidence for X rather than evidence against it), but it can play on the variance.

Suppose you want to have a belief in X in the 0.4 to 0.6 range, and there's a video tape that would clear the matter up completely. Then not watching the video is a good move! If you currently have a belief of 0.3, then you can't bias your video watching, but you could get an idiot to watch the video and recount it vaguely to you; then you might end up with a higher chance (say 20%) of being in the 0.4 to 0.6 range.

Comment author: Yosarian2 10 April 2017 01:45:05PM 0 points [-]

Its failure mode, though, is that it don't preclude, for instance, a probabilistic mix of extreme optimised policy with a random inefficient one.

I think there is a more serious failure mode here.

If a AI wants to keep a utility function within a certain range, what's to stop it from dramatically increasing it's own intelligence, access to resources, ect towards infinitely just to increase the probability of staying within that range in the future from 99.9% up to 99.999%? You still might run into the same "instrumental goals" problem.

Comment author: Stuart_Armstrong 11 April 2017 08:53:41AM 3 points [-]

I'd call that a mix of extreme optimised policy with inefficiency (not in the exact technical sense, but informally).

There's nothing to stop the agent from doing that, but it's also not required. This is expected utility we're talking about, so "expected utility in the range 0.8-1" is achieved - with certainty - by a policy that has a 90% probability of achieving 1 utility (and 10% of achieving 0). You may say there's also a tiny chance of the AI's estimates being wrong, its sensors, its probability calculation... but all that would just be absorbed into, say, a 89% chance of success.

In a sense, this was the hope for the satisficer - that it would do a half-assed effort. But it can choose to do a optimal maximising policy instead. This type of agent can also choose a maximising-style policy, but mix it with deliberate inefficiency. ie it isn't really any better.

Comment author: simon 07 April 2017 04:53:25PM *  0 points [-]

I would think a satisficer would maximize E(g(U)) not g(E(U)).

I assume you are avoiding maximizing E(f(U) because doing so would result in the AI seeking super-high certainty that U is at the maximum of f, leading to side effects?

Edit: it seems to me that once the AI got the expected value it wanted, it would be incentivized to not seek new information since that would adjust the expected value away from that value. So e.g. it might arrange things so that the expected value conditional on it committing suicide is at the intended level, then commit suicide. Or maybe that's a feature not a bug if we want self-limiting AI?

Comment author: Stuart_Armstrong 10 April 2017 06:14:05AM 1 point [-]

Maximising E(f(U)) is just expected utility maximisation on the utility function f(U)

Comment author: whpearson 07 April 2017 05:09:13PM 1 point [-]

You could make f take a tuple of the expectation and the variance if you wanted to and provide ranges for both of them.

FWIW I don't find these discussions likely to be useful. I feel they are sweeping what U is (how it is built, and thus how it changes, is it maximised?) under the carpet and missing some things because of it.

To explain what I mean if U refers in someway to a humans existence/happiness you have to be able to classify primary sense data cameras etc as : - humans from non-human dolls/animatronics - dead humans from one that is in a coma (what does it mean to be alive?) - a free happy human from one that is in prison trying to put on a brave face (what is eudamonia)

Encoding solutions to a bunch of philosophical questions just doesn't seem like a realistic way to do things!

We do not start with Super Intelligence we start with nothing and have to build an AGI. We cannot rely on an pre-human being smart enough to self-improve without mucking up U (I'm not sure we can rely on superhuman either). So we have to do it. The more content we have to put in an unchanging U the harder it will be to make it (and make it bug free).

It seems to preclude all the kind of learning that humans do in this field. We do things differently so there is obviously something that is being missed.

Comment author: Stuart_Armstrong 10 April 2017 06:12:59AM 1 point [-]

This is intended to be a counterexample to some naive assumptions about self-modifying systems.

Agents that don't become maximisers

8 Stuart_Armstrong 07 April 2017 12:56PM

Cross-posted at the Intelligent Agent forum.

According to the basic AI drives thesis, (almost) any agent capable of self-modification will self-modify into an expected utility maximiser.

The typical examples are the inconsistent utility maximisers, the satisficers, unexploitable agents, and it's easy to think that all agents fall roughly into these broad categories. There's also the observation that when looking at full policies rather than individual actions, many biased agents become expected utility maximisers (unless they want to lose pointlessly).

Nevertheless... there is an entire category of agents that generically seem to not self-modify into maximisers. These are agents that attempt to maximise f(E(U)) where U is some utility function, E(U) is its expectation, and f is a function that is neither wholly increasing nor decreasing.

continue reading »

View more: Next