Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Translation "counterfactual"

1 Stuart_Armstrong 24 February 2017 06:36PM

Crossposted at Intelligent Agent Forum

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

continue reading »
Comment author: Gram_Stone 23 February 2017 03:02:28PM 3 points [-]

Recently in the LW Facebook group, I shared a real-world example of an AI being patched and finding a nearby unblocked strategy several times. Maybe you can use it one day. This example is about Douglas Lenat's Eurisko and the strategies it generated in a naval wargame. In this case, the 'patch' was a rules change. For some context, R7 is the name of one of Eurisko's heuristics:

A second use of R7 in the naval design task, one which also inspired a rules change, was in regard to the fuel tenders for the fleet. The constraints specified a minimum fractional tonnage which had to be held back, away from battle, in ships serving as fuel tenders. R7 caused us to consider using warships for that purpose, and indeed that proved a useful decision: whenever some front-line ships were moderately (but not totally) damaged, they traded places with the tenders in the rear lines. This maneuver was explicitly permitted in the rules, but no one had ever employed it except in desperation near the end of a nearly-stalemated battle, when little besides tenders were left intact. Due to the unintuitive and undesirable power of this design, the tournament directors altered the rules so that in 1982 and succeeding years the act of 'trading places' is not so instantaneous. The rules modifications introduced more new synergies (loopholes) than they eliminated, and one of those involved having a ship which, when damaged, fired on (and sunk) itself so as not to reduce the overall fleet agility.

Comment author: Stuart_Armstrong 24 February 2017 10:06:34AM 3 points [-]

Thanks! Do you have a link to the original article?

Comment author: Dagon 23 February 2017 03:47:53PM 2 points [-]

What's the complexity of inputs the robot is using? I think you're mixing up levels of abstraction if you have a relatively complex model for reward, but you only patch using trivially-specific one-liners.

If we're talking about human-level or better AI, why wouldn't the patch be at roughly the same abstraction as for humans? Grandpa yells at the kid if he breaks a white vase while sweeping up the shop. The kid doesn't then think it's OK to break other stuff. Maybe the patch is more systemic - grandpa helps the kid by pointing out a sweeping style that puts less merchandise at risk (and only incidentally is an ancient bo stick fighting style).

In any case, it's never at risk of being interpreted as "don't break white vases".

Even for the nascent chatbots and "AI" systems in play today, which are FAR less complicated and which very often have hamfisted patches to adjust output in ways that are hard to train (like "never use this word, even if it's in a lot of source material"), the people writing patches know the system well enough to know when to fix the model and how to write a semi-general patch.

Comment author: Stuart_Armstrong 24 February 2017 10:06:02AM 1 point [-]

The patches used are to illustrate the point rather than for being realistic specific examples.

Nearest unblocked strategy versus learning patches

6 Stuart_Armstrong 23 February 2017 12:42PM

Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don't like, we can patch the AI's algorithm with patch Pi ('maximise R0 subject to these constraints...'), or modify R to Ri so that Bi doesn't come up. I'll focus more on the patching example, but the modified reward one is similar.

continue reading »

Indifference and compensatory rewards

3 Stuart_Armstrong 15 February 2017 02:49PM

Crossposted at the Intelligent Agents Forum

It's occurred to me that there is a framework where we can see all "indifference" results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R0 and is following policy π0, and we want to change it to having reward R1 and following policy π1.

Then the corrective reward we need to pay it, so that it doesn't attempt to resist or cause that change, is simply the difference between the two expected values:

V(R0|π0)-V(R1|π1),

where V is the agent's own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R0=R1. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R0|π0)=V(R1|π1), and the compensatory rewards are zero.

Comment author: Gurkenglas 07 February 2017 09:11:48PM 0 points [-]

Imagine doing this with one AI: It reformulates each question, then it gets the new prior, then it talks to the human. Ignore that it has to do N work in the first step. That might make this easier to see: Why do you think bringing the questions into a form that allows for easy memorization by humans has anything to do with understanding? It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

Comment author: Stuart_Armstrong 08 February 2017 06:50:43AM 0 points [-]

It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

But that hashmap has to run on a human mind, and understanding helps us run things like that.

Comment author: Dagon 07 February 2017 02:30:18PM 0 points [-]

Ok. The obvious followup is "under what conditions is it a bad thing?" Your college example is a good one - are you saying you want to prevent AIs from making similar changes (but on a perhaps larger scale) that university does to students?

In response to comment by Dagon on Hacking humans
Comment author: Stuart_Armstrong 07 February 2017 02:44:33PM 0 points [-]

Well, there's a formal answer: if an AI can, in condition C, convince any human of belief B for any B, then condition C is not sufficient to constrain the AI's power, and the process is unlikely to be truth-tracking.

That's a sufficient condition for C being insufficient, but not a necessary one.

Comment author: Houshalter 07 February 2017 02:04:36PM 1 point [-]

Same as with the GAN thing. You condition it on producing a correct answer (or whatever the goal is.) So if you are building a question answering AI, you have it model the probability distribution something like P(human types this character | human correctly answers question). This could be done simply by only feeding it examples of correctly answered questions as it's training set. Or you could have it predict what a human might respond if they had n days to think about it.

Though even that may not be necessary. What I had in mind was just having the AI read MIRI papers and produce new ones just like them. Like a superintelligent version of what people do today with markov chains or RNNs to produce writing in the style of an author.

Yes these methods do limit the AI's ability a lot. It can't do anything a human couldn't do, in principle. But it can automate the work of humans and potentially do our job much faster. And if human ability isn't enough to build an FAI, well you could always set it to do intelligence augmentation research instead.

Comment author: Stuart_Armstrong 07 February 2017 02:43:30PM 0 points [-]

I see that working. But we still have the problem that if the number of answers is too large, somewhere there is going to be an answer X, such that the most likely behaviour for a human that answers X is to write something dangerous. Now, that's ok if the AI has two clearly defined processes: first find the top answer, independently of how it's written up, then write up as a human. If those goals are mixed, it will go awry.

Comment author: dogiv 02 February 2017 05:47:50PM 0 points [-]

I don't really understand when this would be useful. Is it for an oracle that you don't trust, or that otherwise has no incentive to explain things to you? Because in that case, nothing here constrains the explanation (or the questions) to be related to reality. It could just choose to explain something that is easy for the human to understand, and then the questions will all be answered correctly. If you do trust the oracle, the second AI is unnecessary--the first one could just ask questions of the human to confirm understanding, like any student-teacher dynamic. What am I missing here?

Comment author: Stuart_Armstrong 07 February 2017 10:08:28AM 0 points [-]
Comment author: Viliam 06 February 2017 02:32:54PM *  1 point [-]

This feels like an attempt to reverse the Dunning–Kruger effect. Not exactly, but there is a similar assumption that "people who believe an incorrect X are usually unaware that answers other than X exist (otherwise they would start doubting whether X is the correct answer)".

Which probably works well for non-controversial topics. You may be wrong about the capital of Australia, but you don't expect there to be a controversy about this topic. If you are aware that many people disagree with you on what "the capital of Australia", you are aware there is a lot of ignorance about this topic, and you have probably double-checked your answer. People who get it wrong probably don't even think about the alternatives.

But, like in the example whpearson gave, there are situations where people are aware that others disagree with them, but they have a handy explanation, such as "it's all a Big Pharma conspiracy", in which case they will neither reduce their certainty, nor research the topic impartially.

In other words, this may work for honest mistakes, but not for tribalism.

Comment author: Stuart_Armstrong 07 February 2017 10:07:27AM 0 points [-]

"it's all a Big Pharma conspiracy", in which case they will neither reduce their certainty, nor research the topic impartially.

The method presupposes rational actors, but is somewhat resilient to non-rational ones. If the majority of people know of the conspiracy theorists, then the conspiracy theory will not be a surprisingly popular option.

View more: Next