Isomorphic agents with different preferences: any suggestions?

3 Stuart_Armstrong 19 September 2016 01:15PM

In order to better understand how AI might succeed and fail at learning knowledge, I'll be trying to construct models of limited agents (with bias, knowledge, and preferences) that display identical behaviour in a wide range of circumstance (but not all). This means their preferences cannot be deduced merely/easily from observations.

Does anyone have any suggestions for possible agent models to use in this project?

Comment author: TheAncientGeek 19 September 2016 11:51:19AM *  0 points [-]

Another way of putting the objection is "don't design a system whose goal system is walled off from its updateable knowledge base". Loosemore's argument is that that is in fact the natural design, and so the "general counter argument" isn't general.

It would be like designing a car whose wheels fall off when you press a button on the dashboard...1) it's possible to build it that way, 2) there's no motivation to build it that way 3) it's more effort to build it that way.

Comment author: Stuart_Armstrong 19 September 2016 12:44:23PM 1 point [-]

"don't design system whose goals system is walled off from its updateable knowledge base"

Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.

I'll try and build practical examples of the failures I have in mind, so that we can discuss them more formally, instead of very nebulously as we are now.

Comment author: TheAncientGeek 16 September 2016 03:25:22PM *  1 point [-]

Are you saying the AI will rewrite its goals to make them easier, or will just not be motivated to fill in missing info?

In the first case, why wont it go the whole hog and wirehead? Which is to say, that any AI which is does anything except wireheading will be resistant to that behaviour -- it is something that needs to be solved, and which we can assume has been solved in a sensible AI design.

When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".

If you programme it with incomplete info, and without any goal to fill in the gaps, then it will have the behaviour you mention...but I'm not seeing the generality. There are many other ways to programme it.

"if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"

An AI that was programmed to attempt to fill in gaps in knowledge it detected, halt if it found conflicts, etc would not behave they way you describe. Consider the objection as actually saying:

"Why has the AI been programmed so as to have selective areas of ignorance and stupidity, which are immune from the learning abilities it displays elsewhere?"

PS This has been discussed before, see

http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/

and

http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/

see particularly

http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/ccpn

Comment author: Stuart_Armstrong 19 September 2016 10:59:28AM 1 point [-]

An AI that was programmed to attempt to fill in gaps in knowledge it detected, halt if it found conflicts, etc would not behave they way you describe.

We don't know how to program a foolproof method of "filling in the gaps" (and a lot of "filling in the gaps" would be a creative process rather that a mere learning one, such as figuring out how to extend natural language concepts to new areas).

And it helps it people speak about this problem in terms of coding, rather than high level concepts, because all the specific examples people have ever come up with for coding learning, have had these kind of flaws. Learning natural language is not some sort of natural category.

Coding learning with some imperfections might be ok if the AI is motivated to merely learn, but is positively pernicious if the AI has other motivations as to what to do with that learning (see my post here for a way of getting around it: https://agentfoundations.org/item?id=947 )

Comment author: jazzkingrt 16 September 2016 05:15:00PM *  0 points [-]

I don't think this problem is very hard to resolve. If an AI is programmed to make sense of natural-language concepts like "chocolate bar", there should be a mechanism to acquire a best-effort understanding. So you could rewrite the motivation as:

"create things which the maximum amount of people understand to be a chocolate bar"

or alternatively:

"create things which the programmer is most likely to have understood to be a chocolate bar".

Comment author: Stuart_Armstrong 19 September 2016 10:52:50AM 1 point [-]

That's just rephrasing one natural language requirement in terms of another. Unless these concepts can be phrased other than in natural language (but then those other phrasings may be susceptible to manipulation).

Learning values versus learning knowledge

5 Stuart_Armstrong 14 September 2016 01:42PM

I just thought I'd clarify the difference between learning values and learning knowledge. There are some more complex posts
about the specific problems with learning values, but here I'll just clarify why there is a problem with learning values in the first place.

Consider the term "chocolate bar". Defining that concept crisply would be extremely difficult. But nevertheless it's a useful concept. An AI that interacted with humanity would probably learn that concept to a sufficient degree of detail. Sufficient to know what we meant when we asked it for "chocolate bars". Learning knowledge tends to be accurate.

Contrast this with the situation where the AI is programmed to "create chocolate bars", but with the definition of "chocolate bar" left underspecified, for it to learn. Now it is motivated by something else than accuracy. Before, knowing exactly what a "chocolate bar" was would have been solely to its advantage. But now it must act on its definition, so it has cause to modify the definition, to make these "chocolate bars" easier to create. This is basically the same as Goodhart's law - by making a definition part of a target, it will no longer remain an impartial definition.

What will likely happen is that the AI will have a concept of "chocolate bar", that it created itself, especially for ease of accomplishing its goals ("a chocolate bar is any collection of more than one atom, in any combinations"), and a second concept, "Schocolate bar" that it will use to internally designate genuine chocolate bars (which will still be useful for it to do). When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".

 

This is the general counter to arguments like "if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"

Causal graphs and counterfactuals

3 Stuart_Armstrong 30 August 2016 04:12PM

Problem solved: Found what I was looking for in: An Axiomatic Characterization Causal Counterfactuals, thanks to Evan Lloyd.

Basically, making every endogenous variable a deterministic function of the exogenous variables and of the other endogenous variables, and pushing all the stochasticity into the exogenous variables.

 

Old post:

A problem that's come up with my definitions of stratification.

Consider a very simple causal graph:

In this setting, A and B are both booleans, and A=B with 75% probability (independently about whether A=0 or A=1).

I now want to compute the counterfactual: suppose I assume that B=0 when A=0. What would happen if A=1 instead?

The problem is that P(B|A) seems insufficient to solve this. Let's imagine the process that outputs B as a probabilistic mix of functions, that takes the value of A and outputs that of B. There are four natural functions here:

  • f0(x) = 0
  • f1(x) = 1
  • f2(x) = x
  • f3(x) = 1-x

Then one way of modelling the causal graph is as a mix 0.75f2 + 0.25f3. In that case, knowing that B=0 when A=0 implies that P(f2)=1, so if A=1, we know that B=1.

But we could instead model the causal graph as 0.5f2+0.25f1+0.25f0. In that case, knowing that B=0 when A=0 implies that P(f2)=2/3 and P(f0)=1/3. So if A=1, B=1 with probability 2/3 and B=1 with probability 1/3.

And we can design the node B, physically, to be one or another of the two distributions over functions or anything in between (the general formula is (0.5+x)f2 + x(f3)+(0.25-x)f1+(0.25-x)f0 for 0 ≤ x ≤ 0.25). But it seems that the causal graph does not capture that.

Owain Evans has said that Pearl has papers covering these kinds of situations, but I haven't been able to find them. Does anyone know any publications on the subject?

Corrigibility through stratified indifference

4 Stuart_Armstrong 19 August 2016 04:11PM

A putative new idea for AI control; index here.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to v, and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

continue reading »
Comment author: Petter 15 August 2016 07:23:01AM 0 points [-]

Looks like a solid improvement over what’s being used in the paper. Does it introduce any new optimization difficulties?

Comment author: Stuart_Armstrong 15 August 2016 09:53:40AM -1 points [-]

I suspect it makes optimisation easier, because we don't need to compute a tradeoff. But that's just an informal impression.

Comment author: Lumifer 11 August 2016 03:00:09PM 3 points [-]

the main point of these ideas is to be able to demonstrate that a certain algorithm - which may be just a complicated messy black box - is not biased

If you're looking to satisfy a legal criterion you need to talk to a lawyer who'll tell you how that works. Notably, the way the law works doesn't have to look reasonable or commonsensical. For example, EEOC likes to observe outcomes and cares little about the process which leads to what they think are biased outcomes.

Because many people treat variables like race as special ... social pressure ... more relevant than it is economically efficient for them to do so ...

Sure, but then you are leaving the realm of science (aka epistemic rationality). You can certainly build models to cater to fads and prejudices of today, but all you're doing is building deliberately inaccurate maps.

I am also not sure what's the deal with "economically efficient". No one said this is the pinnacle of all values and everything must be subservient to economic efficiency.

From the legal perspective, it's probably quite simple.

I am pretty sure you're mistaken about this.

the perception of fairness is probably going to be what's important here

LOL.

I think this is a fundamentally misguided exercise and, moreover, one which you cannot win -- in part because shitstorms don't care about details of classifiers.

Comment author: Stuart_Armstrong 11 August 2016 08:46:35PM -2 points [-]

Do you not feel my definition of fairness is a better one than the one proposed in the original paper?

Comment author: Lumifer 09 August 2016 04:50:17PM 4 points [-]

What are "allowable" variables and what makes one "allowable"?

I'm aiming for something like "once you know income (and other allowable variables) then race should not affect the decision beyond that".

That's the same thing: if S (say, race) does not provide any useful information after controlling for X (say, income) then your classifier is going to "naturally" ignore it. If it doesn't, there is still useful information in S even after you took X into account.

This is all basic statistics, I still don't understand why there's a need to make certain variables (like race) special.

Comment author: Stuart_Armstrong 10 August 2016 07:27:12PM -2 points [-]

As I mentioned in another comment, the main point of these ideas is to be able to demonstrate that a certain algorithm - which may be just a complicated messy black box - is not biased.

I still don't understand why there's a need to make certain variables (like race) special.

a) Because many people treat variables like race as special, and there is social pressure and legislation about that. b) Because historically, people have treated variables like race as more relevant than it is economically efficient for them to do so. c) Because there are arguments (whose validity I don't know) that one should ignore variables like race even when it is individually economically efficient not to. eg cycles of poverty, following of social expectations, etc...

A perfect classifier would solve b), potentially a), and not c). But demonstrating that a classifier is perfect is hard; demonstrating that a classifier is is fair or unbiased in the way I define above is much easier.

What are "allowable" variables and what makes one "allowable"?

This is mainly a social, PR, or legal decision. "Bank assesses borrower's income" is not likely to cause any scandal; "Bank uses eye colour to vet candidates" is more likely to cause problems.

From the legal perspective, it's probably quite simple. "This bank discriminated against me!" Bank: "After controlling for income, capital, past defaults, X, Y, and Z, then our classifiers are free of any discrimination." Then whether they're allowable depends on whether juries or (mainly) judges believe that income, .... X, Y, and Z are valid criteria for reaching a non-discriminatory decision.

Now, for statisticians, if there are a lot of allowable criteria and if the classifier uses them in non-linear ways, this makes the fairness criteria pretty vacuous (since deducing S from many criteria should be pretty easy for non-linear classifiers). However, the perception of fairness is probably going to be what's important here.

View more: Prev | Next