You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Applicable advice

-8 Elo 12 August 2016 08:16AM

Original post: http://bearlamp.com.au/applicable-advice/
Part 2: http://bearlamp.com.au/addendum-to-applicable-advice/
Part 2 on lesswrong: http://lesswrong.com/r/discussion/lw/nuf/addendum_to_applicable_advice/


Einstein said, "If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions."


The Feynman Algorithm:
Write down the problem.
Think real hard.
Write down the solution.


There is a lot of advice out there on the internet. Any topic you can consider; there is probably advice about. The trouble with advice is that it can be just as often wrong as it is right. Often when people write advice; they are writing about what has worked for them in a very specific set of circumstances. I'm going to be lazy and use an easy example several times here - weight loss, but the overarching concept applies to any type of advice.

Generic: "eat less and exercise more" (obvious example is obvious)

Dieting as a problem is a big and complicated one. But the advice is probably effective to someone. Take any person who is looking to lose weight and this advice is probably applicable. Does this make it good advice? Heck no! It's atrocious. If that's all the dieting advice we needed we wouldn't invent diets like Atkins, Grapefruit, 2&5, low carb and more.

So what does, "starve for two days a week", have to it that "eat less and exercise more" doesn't? Why does the damn advice exist?


Advice like, "eat less and exercise more", is likely to work on someone in the situation of:
1. Eating too much
and
2. not exercising enough.
and
3. having those behaviours for no reason
and
4. having the willpower and desire to change those behaviours 
and 
5. never do them again.
and
6. the introspection to identify the problem as that, and start now.

With this understanding of the advice, you can say that this advice applies to some situations and not others. Hence the concept of "applicable advice".

Given that the advice, "eat less and exercise more" exists, if you take the time to understand why it exists and how it works; you can better take advantage of what it offers.

Understand that if this advice worked for someone there was a way that it worked for that someone. And considering if there is a way to make it work for someone, you can maybe find a way to make it work for you too.


How not to use Applicable advice

When you consider that some advice will be able to be adapted, and some will not, you will sometimes end up in a failure mode of using an understanding of why advice worked to explain away the possibility of it working for you.

Example: "you need to speak your mind more often".  Is advice.  If I decide that this advice is targeted at introverted people who like to be confident before they share what they have to say, but who often say nothing at all because of this lack of confidence.  I then assume that if I am not an introverted person then this advice is not applicable to me and should be ignored.

This is the wrong way to apply applicable advice.  First; the model of "why this advice worked", could be wrong.  Second, this way of applying applicable advice is looking at the scientific process wrong.

Briefly the scientific method:

  1. Observe
  2. Hypothesis/prediction
  3. test
  4. analyse
  5. iterate
  6. conclude

Compared to the failure mode:

  1. You noticed the advice worked for someone else
  2. You came up with an explanation about why that advice worked and why it won't work for you
  3. You decided not to test it because you already concluded it won't work for you
  4. you never analyse
  5. you never iterate
  6. you never confirm your conclusion but still concluded the advice won't work.

How to use applicable advice

Use the scientific method*.  As above:

  1. Observe
  2. Hypothesis/prediction
  3. test
  4. analyse
  5. iterate
  6. conclude

Method:

  1. Observe advice working
  2. Come up with an explanation for why it worked.  What world-state conditions are needed for successfully executing said advice, search for how it can be applicable to you.
  3. Try to make the world into a state such that this advice is applicable
  4. Evaluate if it worked
  5. Repeat a few times
  6. Decide if you can make it work.

*yes I realise this is a greatly simplified form of the scientific method.


Map and territory

Our observable difference - in how you should and should not be using applicable advice - comes from an understanding of what you are trying to change.  The map is what we carry around in our head to explain how the world works.  The territory is the real world.  Just by believing the sky is green I can't change the sky.  But if I believed the sky is green, I could change my belief to be more in line with reality.

If you assume the advice you encounter is applicable to someone, AKA the advice suited their map and how it applied to their territory to successfully be useful.  Then when you compare your territory and their territory - they do not match.  Instead of concluding that your territory is immune - that the advice does not apply, you can try to modify your own map to make the advice work for your territory.


Questions:

  • Where have you concluded that advice will not; or does not work for you?
  • Is that true?  And can you change yourself to make that advice apply?
  • Have other people ever failed to take your advice?  What was the advice? and why do you think they didn't take the advice?
  • Have you recently not taken advice given to you?  (What was it? and) Why?  Is there a way to make that advice more useful?

Epistemic status: trying not to do it wrong.


Meta: I have been trying to write this for months and months.  Owing to my new writing processes, I am seeing a lot more success.  Writing this out has only taken 2 hours today, but that doesn't count the 5 hours I had put into earlier versions that I nearly entirely deleted.  It also doesn't count that passive time of thinking about how to explain this over the months and months that I have had this idea floating around in my head.  Including explaining it at a local Dojo and having a few conversations about it.  For this reason I would put the total time spent on this post at 22 hours.

Fairness in machine learning decisions

-2 Stuart_Armstrong 05 August 2016 09:56AM

There's been some recent work on ensuring fairness in automated decision making, especially around sensitive areas such as racial groups. The paper "Censoring Representations with an Adversary" looks at one way of doing this.

It looks at a binary classification task where X ⊂ Rn and Y = {0, 1} is the (output) label set. There is also S = {0, 1} which is a protected variable label set. The definition of fairness is that, if η : X → Y is your classifier, then η should be independent of S. Specifically:

  • P(η(X)=1|S=1) = P(η(X)=1|S=0)

There is a measure of discrimination, which is the extent to which the classifier violates that fairness assumption. The paper then suggests to tradeoff optimise the difference between discrimination and classification accuracy.

But this is problematic, because it risks throwing away highly relevant information. Consider redlining, the practice of denying services to residents of certain areas based on the racial or ethnic makeups of those areas. This is the kind of practice we want to avoid. However, generally the residents of these areas will be poorer than the average population. So if Y is approval for mortgages or certain financial services, a fair algorithm would essentially be required to reach a decision that ignores this income gap.

And it doesn't seem the tradeoff with accuracy is a good way of compensating for this. Instead, a better idea would be to specifically allow certain variables to be considered. For example, let T be another variable (say, income) that we want to allow. Then fairness would be defined as:

  • ∀t, P(η(X)=1|S=1, T=t) = P(η(X)=1|S=0, T=t)

What this means is that T can distinguish between S=0 and S=1, but, once we know the value of T, we can't deduce anything further about S from η. For instance, once the bank knows your income, it should be blind to other factors.

Of course, with enough T variables, S can be determined with precision. So each T variable should be fully justified, and in general, it must not be easy to establish the value of S via T.

Examples of AI's behaving badly

25 Stuart_Armstrong 16 July 2015 10:01AM

Some past examples to motivate thought on how AI's could misbehave:

An algorithm pauses the game to never lose at Tetris.

In "Learning to Drive a Bicycle using Reinforcement Learning and Shaping", Randlov and Alstrom, describes a system that learns to ride a simulated bicycle to a particular location. To speed up learning, they provided positive rewards whenever the agent made progress towards the goal. The agent learned to ride in tiny circles near the start state because no penalty was incurred from riding away from the goal.

A similar problem occurred with a soccer-playing robot being trained by David Andre and Astro Teller (personal communication to Stuart Russell). Because possession in soccer is important, they provided a reward for touching the ball. The agent learned a policy whereby it remained next to the ball and “vibrated,” touching the ball as frequently as possible. 

Algorithms claiming credit in Eurisko: Sometimes a "mutant" heuristic appears that does little more than continually cause itself to be triggered, creating within the program an infinite loop. During one run, Lenat noticed that the number in the Worth slot of one newly discovered heuristic kept rising, indicating that had made a particularly valuable find. As it turned out the heuristic performed no useful function. It simply examined the pool of new concepts, located those with the highest Worth values, and inserted its name in their My Creator slots.

There was something else going on, though. The AI was crafting super weapons that the designers had never intended. Players would be pulled into fights against ships armed with ridiculous weapons that would cut them to pieces. "It appears that the unusual weapons attacks were caused by some form of networking issue which allowed the NPC AI to merge weapon stats and abilities," according to a post written by Frontier community manager Zac Antonaci. "Meaning that all new and never before seen (sometimes devastating) weapons were created, such as a rail gun with the fire rate of a pulse laser. These appear to have been compounded by the additional stats and abilities of the engineers weaponry."

What's special about a fantastic outcome? Suggestions wanted.

0 Stuart_Armstrong 11 November 2014 11:04AM

I've been returning to my "reduced impact AI" approach, and currently working on some idea.

What I need is some ideas on features that might distinguish between an excellent FAI outcome, and a disaster. The more abstract and general the ideas, the better. Anyone got some suggestions? Don't worry about quality at this point, originality is more prized!

I'm looking for something generic that is easy to measure. At a crude level, if the only options were "papercliper" vs FAI, then we could distinguish those worlds by counting steel content.

So basically some more or less objective measure that has a higher proportion of good outcomes than the baseline.

Reduced impact in practice: randomly sampling the future

1 Stuart_Armstrong 17 March 2014 04:30PM

This was an idea developed along with Eliezer when he was at the FHI in November.

In my post on "reduced impact AI" I talked about how you could estimate the impact of an AI by (hypothetically) observing a slice of the future. When Eliezer was here last November, we talked about how this could be cached out, and came up with a model.

Ignore the "output channel" from the original reduced impact post, and consider this situation:

We have two computers running in two similarly shielded rooms. One of them is a simple algorithm running through the digits of e (call it run-e), that was soon after deleted. Another is our dear old friend clippy, the AI motivated to maximise paperclips. Without removing clippy's paperclip preferences, we'd like to motivate it to have next to no impact on the world. How should we define "next to no impact"?

The obvious think is to compare it with run-e, and require it to have no more expected impact than that machine. How could this be measured? Well, the original idea was to put a stochastic process on clippy as it was turned on (inside the shielded room) and say that clippy has reduced impact if we can't find any extra evidence, from a vantage point 50 years in the future, that clippy was successfully turned on. Now put the stochastic same process on run-e and define:

Clippy has reduced impact if, from a vantage of 50 years into the future, we have no more evidence that clippy was turned on than we have of run-e being turned on.

continue reading »

Trapping AIs via utility indifference

3 Stuart_Armstrong 28 February 2012 07:27PM

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.