Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Indifference and compensatory rewards

3 Stuart_Armstrong 15 February 2017 02:49PM

Crossposted at the Intelligent Agents Forum

It's occurred to me that there is a framework where we can see all "indifference" results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R0 and is following policy π0, and we want to change it to having reward R1 and following policy π1.

Then the corrective reward we need to pay it, so that it doesn't attempt to resist or cause that change, is simply the difference between the two expected values:


where V is the agent's own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R0=R1. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R0|π0)=V(R1|π1), and the compensatory rewards are zero.

Comment author: Gurkenglas 07 February 2017 09:11:48PM 0 points [-]

Imagine doing this with one AI: It reformulates each question, then it gets the new prior, then it talks to the human. Ignore that it has to do N work in the first step. That might make this easier to see: Why do you think bringing the questions into a form that allows for easy memorization by humans has anything to do with understanding? It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

Comment author: Stuart_Armstrong 08 February 2017 06:50:43AM 0 points [-]

It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

But that hashmap has to run on a human mind, and understanding helps us run things like that.

Comment author: Dagon 07 February 2017 02:30:18PM 0 points [-]

Ok. The obvious followup is "under what conditions is it a bad thing?" Your college example is a good one - are you saying you want to prevent AIs from making similar changes (but on a perhaps larger scale) that university does to students?

In response to comment by Dagon on Hacking humans
Comment author: Stuart_Armstrong 07 February 2017 02:44:33PM 0 points [-]

Well, there's a formal answer: if an AI can, in condition C, convince any human of belief B for any B, then condition C is not sufficient to constrain the AI's power, and the process is unlikely to be truth-tracking.

That's a sufficient condition for C being insufficient, but not a necessary one.

Comment author: Houshalter 07 February 2017 02:04:36PM 1 point [-]

Same as with the GAN thing. You condition it on producing a correct answer (or whatever the goal is.) So if you are building a question answering AI, you have it model the probability distribution something like P(human types this character | human correctly answers question). This could be done simply by only feeding it examples of correctly answered questions as it's training set. Or you could have it predict what a human might respond if they had n days to think about it.

Though even that may not be necessary. What I had in mind was just having the AI read MIRI papers and produce new ones just like them. Like a superintelligent version of what people do today with markov chains or RNNs to produce writing in the style of an author.

Yes these methods do limit the AI's ability a lot. It can't do anything a human couldn't do, in principle. But it can automate the work of humans and potentially do our job much faster. And if human ability isn't enough to build an FAI, well you could always set it to do intelligence augmentation research instead.

Comment author: Stuart_Armstrong 07 February 2017 02:43:30PM 0 points [-]

I see that working. But we still have the problem that if the number of answers is too large, somewhere there is going to be an answer X, such that the most likely behaviour for a human that answers X is to write something dangerous. Now, that's ok if the AI has two clearly defined processes: first find the top answer, independently of how it's written up, then write up as a human. If those goals are mixed, it will go awry.

Comment author: dogiv 02 February 2017 05:47:50PM 0 points [-]

I don't really understand when this would be useful. Is it for an oracle that you don't trust, or that otherwise has no incentive to explain things to you? Because in that case, nothing here constrains the explanation (or the questions) to be related to reality. It could just choose to explain something that is easy for the human to understand, and then the questions will all be answered correctly. If you do trust the oracle, the second AI is unnecessary--the first one could just ask questions of the human to confirm understanding, like any student-teacher dynamic. What am I missing here?

Comment author: Stuart_Armstrong 07 February 2017 10:08:28AM 0 points [-]
Comment author: Viliam 06 February 2017 02:32:54PM *  1 point [-]

This feels like an attempt to reverse the Dunning–Kruger effect. Not exactly, but there is a similar assumption that "people who believe an incorrect X are usually unaware that answers other than X exist (otherwise they would start doubting whether X is the correct answer)".

Which probably works well for non-controversial topics. You may be wrong about the capital of Australia, but you don't expect there to be a controversy about this topic. If you are aware that many people disagree with you on what "the capital of Australia", you are aware there is a lot of ignorance about this topic, and you have probably double-checked your answer. People who get it wrong probably don't even think about the alternatives.

But, like in the example whpearson gave, there are situations where people are aware that others disagree with them, but they have a handy explanation, such as "it's all a Big Pharma conspiracy", in which case they will neither reduce their certainty, nor research the topic impartially.

In other words, this may work for honest mistakes, but not for tribalism.

Comment author: Stuart_Armstrong 07 February 2017 10:07:27AM 0 points [-]

"it's all a Big Pharma conspiracy", in which case they will neither reduce their certainty, nor research the topic impartially.

The method presupposes rational actors, but is somewhat resilient to non-rational ones. If the majority of people know of the conspiracy theorists, then the conspiracy theory will not be a surprisingly popular option.

Comment author: Houshalter 06 February 2017 04:22:30PM 1 point [-]

Selecting from a list of predetermined answers extremely limits the AI's ability. Which isn't good if we want it to actually solve very complex problems for us! And that method by itself doesn't make the AI safe, just makes it much harder for it to do anything at all.

Note someone found a way to simplify my original idea in the comments. Instead of using the somewhat complicated GAN thing, you can just have it try to predict the next letter a human would type. In theory these methods are exactly equivalent.

Comment author: Stuart_Armstrong 07 February 2017 10:06:44AM 0 points [-]

Instead of using the somewhat complicated GAN thing, you can just have it try to predict the next letter a human would type.

How do you trade that off against giving an actually useful answer?

Comment author: Dagon 06 February 2017 03:57:54PM 1 point [-]

How much have you explored the REASONS that brainwashing is seen as not cool, while quiet rational-seeming chat is perfectly fine? Are you sure it's only about efficacy?

I worry that there's some underlying principle missing from the conversation, about agentiness and "free will" of humans, which you're trying to preserve without defining. It'd be much stronger to identify the underlying goals and include them as terms in the AI's utility function(s).

In response to comment by Dagon on Hacking humans
Comment author: Stuart_Armstrong 07 February 2017 10:04:38AM 0 points [-]

Are you sure it's only about efficacy?

No, but I'm pretty sure efficacy plays a role. Look at the (stereotypical) freakout from some conservative parents about their kids attending university; it's not really about the content or the methods, but because changes in values or beliefs are expected to some degree.

Comment author: dogiv 06 February 2017 05:15:59PM 1 point [-]

Thank you, this is clearer than it was before, and it does seem like a potentially useful technique. I see a couple of limitations:

First, it still seems that the whole plan rests on having a good selection of questions, and the mechanism for choosing them is unclear. If they are chosen by some structured method that thoroughly covers the AI's representation of the prior, the questions asked of the human are unlikely to capture the most important aspects of the update from new evidence. Most of the differences between the prior and the posterior could be insignificant from a human perspective, and so even if the human "understands" the posterior a broad sense they will not be likely to have the answers to all of these. Even if they can figure out those answers correctly, it does not necessarily test whether they are aware of the differences that are most important.

Second, the requirement for the two AIs to have a common prior, and differ only by some known quantum of new evidence, seems like it might restrict the applications considerably. In simple cases you might handle this by "rolling back" a copy of the first AI to a time when it had not yet processed the new evidence, and making that the starting point for the second AI. But if the processing of the evidence occurred before some other update that you want included in the prior, then you would need some way of working backward to a state that never previously existed.

Comment author: Stuart_Armstrong 07 February 2017 10:01:59AM 0 points [-]

Your first point is indeed an issue, and I'm thinking about it. The second is less of a problem, because now we have a goal description, so implementing the goal is less of an issue.

Comment author: whpearson 06 February 2017 08:21:47PM 1 point [-]

I much prefer to write things down in code, if I am writing things down.

I shall think about the discipline of formulating the problem for others.

Comment author: Stuart_Armstrong 07 February 2017 10:00:06AM 1 point [-]

Writing in code is good. Writing for others is to make sure you have the concepts correctly (syntax vs semantics, if you want).

View more: Next