Comment author: Houshalter 03 February 2016 01:50:32PM *  -1 points [-]

I'm not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)

You might have a point that it's probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.

Comment author: paulfchristiano 03 February 2016 06:25:28PM 1 point [-]

The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.

This point doesn't seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)

Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator's behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn't introduce any potentially problematic optimization pressure, so it just seems better.

Comment author: paulfchristiano 03 February 2016 06:16:54PM 1 point [-]

I think this is a good idea, though it's not new. I have written about this at some length (jessica linked to a few examples, but much of the content here is relevant), and it's what people usually are trying to do in apprenticeship learning. I agree there is probably no realistic scenario where you would use the reduced impact machinery instead of doing this the way you describe (i.e. the way people already do it).

Having the AI try to solve the problem (rather than simply trying to mimic the human) doesn't really buy you that much, and has big costs. If the human can't solve the problem with non-negligible probability, then you simply aren't going to get a good result using this technique. And if the human can solve the problem, then you can just train on instances where the human successfully solves it. You don't save anything computationally with the conditioning.

Bootstrapping seems like the most natural way to improve performance to superhuman levels. I expect bootstrapping to work fine, if you could get the basic protocol off the ground.

The connection to adversarial networks is not really a "parallel." They are literally the same thing (modulo your extra requirement that the system do the task, which is equivalent to Jessica's quantilization proposal but which I think should definitely be replaced with bootstrapping).

I think the most important problem is that AI systems do tasks in inhuman ways, such that imitating a human entails a significant disadvantage. Put a different way, it may be harder to train an AI to imitate a human than to simply do the task. So I think the main question is how to get over that problem. I think this is the baseline to start from, but it probably won't work in general.

Overall I feel more optimistic about approval-direction than imitation for this reason. But approval-direction has its own (extremely diluted) versions of the usual safety concerns, and imitation is pretty great since it literally avoids them altogether. So if it could be fixed that would be great.

This post covers the basic idea of collecting training data with low probability online. This post describes why it might result in very low overhead for aligned AI systems.

Comment author: Wei_Dai 08 May 2013 11:15:25PM 14 points [-]

I have a problem with calling this a "semi-open FAI problem", because even if Eliezer's proposed solution turns out to be correct, it's still a wide open problem to develop arguments that can allow us to be confident enough in it to incorporate it into an FAI design. This would be true even if nobody can see any holes in it or have any better ideas, and doubly true given that some FAI researchers consider a different approach (which assumes that there is no such thing as "reality-fluid", that everything in the multiverse just exists and as a matter of preference we do not / can not care about all parts of it in equal measure, #4 in this post) to be at least as plausible as Eliezer's current approach.

Comment author: paulfchristiano 31 January 2016 08:09:14PM *  0 points [-]

In my view, we could make act-based agents without answering this or any similar questions. So I'm much less interested in answering them then I used to be. (There are possible approaches that do have to answer all of these questions, but at this point they seem very much less promising to me.)

We've briefly discussed this issue in the abstract, but I'm curious to get your take in a concrete case. Does that seem right to you? Do you think that we need to understand issues like this one, and have confidence in that understanding, prior to building powerful AI systems?

Comment author: gwern 20 January 2016 04:59:48PM 5 points [-]

It is not totally clear why humans are this bad at math. It is almost certainly unrelated to brains computing using neurons instead of transistors.

Why do you think that? Adding numbers is highly challenging for RNNs and is a standard challenge in recent papers investigating various kinds of differentiable memory and attention mechanisms, precisely because RNNs do so badly at it (like humans).

Comment author: paulfchristiano 20 January 2016 05:38:47PM 4 points [-]

It's a bit hard for RNN's to learn, but they can end up much better than humans. (Also, the reason it is being used as a challenge is because it is a bit tricky but not very tricky.)

It is probably also easy to "teach" humans to be much better at math than we currently are (over evolutionary time), there's just no pressure for math performance. That seems like the most likely difference between humans and computers.

Comment author: Vaniver 06 December 2015 06:23:28PM 3 points [-]

I agree that those directions look promising.


My impression is that act-based approaches are good for human replacements, but not good for human meta-replacements. That is, if we consider the problem of fulfilling orders in an Amazon warehouse, we have a number of different problems:

  1. Move a worker and a crate of goods adjacent to each other.

  2. Move a single good from the crate to the order box.

  3. Organize the facility to make the above two as cost-efficient as possible.

The first is already replaced by robots, the second seems like a good candidate for imitation (we need robot eyes and robot hands that are about as good as human eyes and hands, and they can work basically the same way) / low-level control theory.

But the third is a problem of cost functions, models, simuation, calculation, and possibility enumeration. It's where creativity most comes into play, and that's something I'm pessimistic about getting out of interpolative systems instead of extrapolative systems.

There are still major gains made by the reduced scope--a warehouse management system seems easier to make safe than a generic moneymaker--but I think there's a fundamental connection between the opportunity and danger of AI (the ability to extrapolate into previously unseen realms).

Comment author: paulfchristiano 06 December 2015 09:17:52PM *  5 points [-]

Some things can be done by imitation based on our current understanding (and will get better as machine learning improves). The interesting part of the project is figuring out how to do the trickier things, which will require new ideas.

It's not clear that imitation impairs your ability to generalize to new domains. An RL agent faces the question: in this new domain, how should I behave to receive rewards? It has not been trained in the domain, but must learn to reason about the domain and figure out what policies will work well. An imitation learner faces the question: in this new domain, how would the expert behave / what behavior would they approve of? The two questions seem similar in difficulty, and indeed you could use the same algorithmic ingredients.

It's also not clear that it's relevant if a task involves thinking about cost functions, models, simulation, calculation, etc... These are techniques one could apply either to achieve a high reward, or to produce actions the expert would approve of / like / do themselves. You might say that at that point these rich internal behaviors must be guided by some non-trivial internal dynamic. But then we will just have the same discussion a level lower.

My research priorities for AI control

17 paulfchristiano 06 December 2015 01:57AM

I've been thinking about what research projects I should work on, and I've posted my current view. Naturally, I think these are also good projects for other people to work on as well.

Brief summaries of the projects I find most promising:

The post briefly discusses where I am coming from, and links to a good deal more clarification. I'm always interested in additional thoughts and criticisms, since changing my views on these questions would directly influence what I spend my time on.

 

Comment author: Stuart_Armstrong 16 November 2015 12:00:13PM -1 points [-]

? I don't see why the world needs to be sufficiently convenient to allow (1). And the problem resurfaces with huge-but-bounded utilities, so invoking (2) is not enough.

Comment author: paulfchristiano 18 November 2015 01:39:10AM *  0 points [-]

You cited avoiding the "immense potential damage of being known to be Pascal muggable" as a motivating factor for actual humans, suggesting that you were talking about the real world. There might be some damage from being "muggable," but it's not clear why being known to be muggable is a disadvantage, given that here in the real world we don't pay the mugger regardless of our philosophical views.

I agree that you can change the thought experiment to rule out (1). But if you do, it loses all of its intuitive force. Think about it from the perspective of someone in the modified thought experiment:

You are 100% sure there is no other way to get as much utility as the mugger promises at any other time in the future of the universe. But somehow you aren't so sure about the mugger's offer. So this is literally the only possible chance in all of history to get an outcome this good, or even anywhere close. Do you pay then?

"Yes" seems like a plausible answer (even before the mugger opens her mouth). The real question is how you came to have such a bizarre state of knowledge about the world, not why you are taking the mugger seriously once you do!

Comment author: jsteinhardt 10 November 2015 08:29:52AM *  1 point [-]

Yeah I should be a bit more careful on number 4. The point is that many papers which argue that a given NN is learning "natural" representations do so by looking at what an individual hidden unit responds to (as opposed to looking at the space spanned by the hidden layer as a whole). Any such argument seems dubious to me without further support, since it relies on a sort of delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself. But I agree that if such an argument was accompanied by justification of why the training procedure or data noise or some other factor led to the symmetry being broken in a natural way, then I would potentially be happy.

Comment author: paulfchristiano 15 November 2015 01:15:19AM 0 points [-]

delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself

I'm still not convinced. The pointwise nonlinearities introduce a preferred basis, and cause the individual hidden units to be much more meaningful than linear combinations thereof.

Comment author: Stuart_Armstrong 13 November 2015 12:37:11PM 0 points [-]

It's not really clear why you would have the searching process be more powerful than the evaluating process

Because the first supposes a powerful AI, while the second supposes an excellent evaluation process (essentially a value alignment problem solved).

Your post motivated this in part, but it's a more general issue with optimisation processes and searches.

Comment author: paulfchristiano 15 November 2015 01:11:46AM 0 points [-]

Neither the search nor the evaluation presupposes an AI when a hypothetical process is used as the definition of "good."

Comment author: Stuart_Armstrong 13 November 2015 12:41:17PM 0 points [-]

An unbounded utility function does not literally make you "Pascal's muggable"; there are much better ways to seek infinite utility than to pay a mugger.

Have you solved that problem, then? Most people I've talked don't seem to believe it's solved.

except as an argument against taking extreme and irreversible actions on the basis of a simple model of your values that looks appealing at the moment.

The approach I presented is designed so that you can get as close to your simple model, while reducing the risks of doing so.

Comment author: paulfchristiano 15 November 2015 01:09:02AM *  0 points [-]

Have you solved that problem, then? Most people I've talked don't seem to believe it's solved.

You aren't supposed to literally pay the mugger; it's an analogy. Either:

(1) you do something more promising to capture hypothetical massive utility (e.g. this happens if we have a plausible world model and place a finite but massive upper bound on utilities), or (2) you are unable to make a decision because all payoffs are infinite.

View more: Prev | Next