eli_sennesh comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong

8 Post author: Richard_Loosemore 05 May 2015 02:46AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (343)

You are viewing a single comment's thread. Show more comments above.

Comment author: jessicat 05 May 2015 10:24:34PM *  7 points [-]

Thanks for your response.

The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.

So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text). A sufficiently advanced AI's concept space might contain a similar concept. But how do we pinpoint this concept in the AI's concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the "giving choices to people" vs. "forcing them to do something" distinction on multiple examples, but are different in important ways. We need to pinpoint it in order to make this concept part of the AI's decision-making procedure.

It will also be able to model people (as it must be able to do, because all intelligent systems must be able to model the world pretty accurately or they don't qualifiy as 'intelligent') so it will probably have a pretty shrewd idea already of whether people will react positively or negatively toward some intended action plan.

This seems pretty similar to Paul's idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we'll probably get AGI before we get uploads, so we'll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we'll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it's a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.

In all of that procedure I just described, why would the explanation of the plans to the people be problematic? People will ask questions about what the plans involve. If there is technical complexity, they will ask for clarification. If the plan is drastic there will be a world-wide debate, and some people who finds themselves unable to comprehend the plan will turn to more expert humans for advice.

What language will people's questions about the plans be in? If it's a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this. If it's a more technical language, then humans themselves must be able to look at the AI's concept space and understand it. Whether this is possible very much depends on how transparent the AI's concept space is. Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.

We should also definitely be wary of a decision rule of the form "find a plan that, if explained to humans, would cause humans to say they understand it". Since people are easy to manipulate, raw optimization for this objective will produce psychologically manipulative plans that people will incorrectly approve of. There needs to be some way to separate "optimize for the plan being good" from "optimize for people thinking the plan is good when it is explained to them", or else some way of ensuring that humans' judgments about these plans are accurate.

Again, it's quite plausible that the AI's concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI's concept space in order to pinpoint this concept so it can be integrated into the AI's decision rule.

I should mention that I don't think that these black-box approaches to AI control are necessarily doomed to failure; rather, I'm pointing out that there are lots of unresolved gaps in our knowledge of how they can be made to work, and it's plausible that they are too difficult in practice.

Comment author: [deleted] 07 May 2015 02:43:33PM *  1 point [-]

Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.

Maybe I'm biased as an open proponent of probabilistic programming, but I think the latter can make AGI at all, while the former not only would result in opaque AGI, but basically can't result in a successful real-world AGI at all.

I don't think you can get away from the need to do hierarchical inference on complex models in Turing-complete domains (in short: something very like certain models expressible in probabilistic programming). A deep neural net is basically just drawing polygons in a hierarchy of feature spaces, and hoping your polygons have enough edges to approximate the shape you really mean but not so many edges that they take random noise in the training data to be part of the shape -- given just the right conditions, it can approximate the right thing, but it can't even describe how to do the right thing in general.