The AI in Mary's room

4 Stuart_Armstrong 24 May 2016 01:19PM

In the Mary's room thought experiment, Mary is a brilliant scientist in a black-and-white room who has never seen any colour. She can investigate the outside world through a black-and-white television, and has piles of textbooks on physics, optics, the eye, and the brain (and everything else of relevance to her condition). Through this she knows everything intellectually there is to know about colours and how humans react to them, but she hasn't seen any colours at all.

After that, when she steps out of the room and sees red (or blue), does she learn anything? It seems that she does. Even if she doesn't technically learn something, she experiences things she hadn't ever before, and her brain certainly changes in new ways.

The argument was intended as a defence of qualia against certain forms of materialism. It's interesting, and I don't intent to solve it fully here. But just like I extended Searle's Chinese room argument from the perspective of an AI, it seems this argument can also be considered from an AI's perspective.

Consider a RL agent with a reward channel, but which currently receives nothing from that channel. The agent can know everything there is to know about itself and the world. It can know about all sorts of other RL agents, and their reward channels. It can observe them getting their own rewards. Maybe it could even interrupt or increase their rewards. But, all this knowledge will not get it any reward. As long as its own channel doesn't send it the signal, knowledge of other agents rewards - even of identical agents getting rewards - does not give this agent any reward. Ceci n'est pas une récompense.

This seems to mirror Mary's situation quite well - knowing everything about the world is no substitute from actually getting the reward/seeing red. Now, a RL's agent reward seems closer to pleasure than qualia - this would correspond to a Mary brought up in a puritanical, pleasure-hating environment.

Closer to the original experiment, we could imagine the AI is programmed to enter into certain specific subroutines, when presented with certain stimuli. The only way for the AI to start these subroutines, is if the stimuli is presented to them. Then, upon seeing red, the AI enters a completely new mental state, with new subroutines. The AI could know everything about its programming, and about the stimulus, and, intellectually, what would change about itself if it saw red. But until it did, it would not enter that mental state.

If we use ⬜ to (informally) denote "knowing all about", then ⬜(X→Y) does not imply Y. Here X and Y could be "seeing red" and "the mental experience of seeing red". I could have simplified that by saying that ⬜Y does not imply Y. Knowing about a mental state, even perfectly, does not put you in that mental state.

This closely resembles the original Mary's room experiment. And it seems that if anyone insists that certain features are necessary to the intuition behind Mary's room, then these features could be added to this model as well.

Mary's room is fascinating, but it doesn't seem to be talking about humans exclusively, or even about conscious entities.

Double Corrigibility: better Corrigibility

5 Stuart_Armstrong 28 April 2016 02:46PM

A putative new idea for AI control; index here.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI's utility changes from u to v at time t, it maximises a meta-utility U such that

  • U≤t = u
  • U>t = v + E(u|u→u) - E(v|u→v).

Here u→u designates the event that u remains unchanged, while u→v designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change - it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E' with a high E'(u|u→u) and a low E'(v|u→v). Its ability to rewire is limited by the fact that E' will serve as its future expectation estimator, so it won't mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E' probably can't rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

 

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI's current expectation/probability estimator. Let ρ' be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That's because E(C)=E(u0|u0→u0)-E(un |un-1→un, H), where u0 is the AI's first utility, un its last, and H intermediate history. So some milder restrictions on the ui should suffice to bound C.

Then, whenever the AI takes an action that changes ρ to ρ', it gets the special meta-compensatory reward

  • Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ').

This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.

Note that it doesn't prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action a might have an indirect effect on ρ, the "utility maximising" for a given u must be changed to:

  • Eρ(u|a) + Σρ' Pρ(ρ→ρ'|a) (Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ')),

where Pρ is the probability estimate corresponding to ρ; the probability term can be rewritten as Eρ(Iρ→ρ') for Iρ→ρ' the indicator function for ρ→ρ'. In fact the whole line above can be rewritten as

  • Eρ(u|a) + Eρ(Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ') | a).

For this to work, Eρ needs to be able to say sensible things about itself, and also about Eρ', which is used to estimate C if ρ→ρ'.

If we compare this with various ways of factoring out variables, we can see that it's a case where we have a clear default, ρ, and are estimating deviations from that.

JFK was not assassinated: prior probability zero events

20 Stuart_Armstrong 27 April 2016 11:47AM

A lot of my work involves tweaking the utility or probability of an agent to make it believe - or act as if it believed - impossible or almost impossible events. But we have to be careful about this; an agent that believes the impossible may not be so different from one that doesn't.

Consider for instance an agent that assigns a prior probability of zero to JFK ever having been assassinated. No matter what evidence you present to it, it will go on disbelieving the "non-zero gunmen theory".

Initially, the agent will behave very unusually. If it was in charge of JFK's security in Dallas before the shooting, it would have sent all secret service agents home, because no assassination could happen. Immediately after the assassination, it would have disbelieved everything. The films would have been faked or misinterpreted; the witnesses, deluded; the dead body of the president, that of twin or an actor. It would have had huge problems with the aftermath, trying to reject all the evidence of death, seeing a vast conspiracy to hide the truth of JFK's non-death, including the many other conspiracy theories that must be false flags, because they all agree with the wrong statement that the president was actually assassinated.

But as time went on, the agent's behaviour would start to become more and more normal. It would realise the conspiracy was incredibly thorough in its faking of the evidence. All avenues it pursued to expose them would come to naught. It would stop expecting people to come forward and confess the joke, it would stop expecting to find radical new evidence overturning the accepted narrative. After a while, it would start to expect the next new piece of evidence to be in favour of the assassination idea - because if a conspiracy has been faking things this well so far, then they should continue to do so in the future. Though it cannot change its view of the assassination, its expectation for observations converge towards the norm.

If it does a really thorough investigation, it might stop believing in a conspiracy at all. At some point, the probability of a miracle will start to become more likely than a perfect but undetectable conspiracy. It is very unlikely that Lee Harvey Oswald shot at JFK, missed, and the president's head exploded simultaneously for unrelated natural causes. But after a while, such a miraculous explanation will start to become more likely than anything else the agent can consider. This explanation opens the possibility of miracles; but again, if the agent is very thorough, it will fail to find evidence of other miracles, and will probably settle on "an unrepeatable miracle caused JFK's death in a way that is physically undetectable".

But then note that such an agent will have a probability distribution over future events that is almost indistinguishable from a normal agent that just believes the standard story of JFK being assassinated. The zero-prior has been negated, not in theory but in practice.

 

How to do proper probability manipulation

This section is still somewhat a work in progress.

So the agent believes one false fact about the world, but its expectation is otherwise normal. This can be both desirable and undesirable. The negative is if we try and control the agent forever by giving it a false fact.

To see the positive, ask why would we want an agent to believe impossible things in the first place? Well, one example was an Oracle design where the Oracle didn't believe its output message would ever be read. Here we wanted the Oracle to believe the message wouldn't be read, but not believe anything else too weird about the world.

In terms of causality, if X designates the message being read at time t, and B and A are event before and after t, respectively, we want P(B|X)≈P(B) (probabilities about current facts in the world shouldn't change much) while P(A|X)≠P(A) is fine and often expected (the future should be different if the message is read or not).

In the JFK example, the agent eventually concluded "a miracle happened". I'll call this miracle a scrambling point. It's kind of a breakdown in causality: two futures are merged into one, given two different pasts. The two pasts are "JFK was assassinated" and "JFK wasn't assassinated", and their common scrambled future is "everything appears as if JFK was assassinated". The non-assassination belief has shifted the past but not the future.

For the Oracle, we want to do the reverse: we want the non-reading belief to shift the future but not the past. However, unlike the JFK assassination, we can try and build the scrambling point. That's why I always talk about messages going down noisy wires, or specific quantum events, or chaotic processes. If the past goes through a truly stochastic event (it doesn't matter whether there is true randomness or just that the agent can't figure out the consequences), we can get what we want.

The Oracle idea will go wrong if the Oracle conclude that non-reading must imply something is different about the past (maybe it can see through chaos in ways we thought it couldn't), just as the JFK assassination denier will continue to be crazy if can't find a route to reach "everything appears as if JFK was assassinated".

But there is a break in the symmetry: the JFK assassination denier will eventually reach that point as long as the world is complex and stochastic enough. While the Oracle requires that the future probabilities be the same in all (realistic) past universes.

Now, once the Oracle's message has been read, the Oracle will find itself in the same situation as the other agent: believing an impossible thing. For Oracles, we can simply reset them. Other agents might have to behave more like the JFK assassination disbeliever. Though if we're careful, we can quantify things more precisely, as I attempted to do here.

One weird trick to turn maximisers into minimisers

1 Stuart_Armstrong 22 April 2016 04:47PM

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:

  • u# = (2/ε)Xu - u.

Before t, the expected value of (2/ε)X is 2, so u# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.

This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

  • u# = Ω(2/ε)Xu - u.

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.

Expect to know better when you know more

3 Stuart_Armstrong 21 April 2016 03:47PM

A seemingly trivial result, that I haven't seen posted anywhere in this form, that I could find. It simply shows that we expect evidence to increase the posterior probability of the true hypothesis.

Let H be the true hypothesis/model/environment/distribution, and ~H its negation. Let e be evidence we receive, taking values e1, e2, ... en. Let pi=P(e=ei|H) and qi=P(E=ei|~H).

The expected posterior weighting of H, P(e|H), is Σpipi while the expected posterior weighting of ~H, P(e|~H), is Σqipi. Then since the pi and qi both sum to 1, Cauchy–Schwarz implies that

 

  • E(P(e|H)) ≥ E(P(e|~H)).

Thus, in expectation, the probability of the evidence given the true hypothesis, is higher than or equal to the probability of the evidence given its negation.

This, however, doesn't mean that the Bayes factor - P(e|H)/P(e|~H) - must have expectation greater than one, since ratios of expectation are not the same as expectations of ratio. The Bayes factor given e=ei is (pi/qi). Thus the expected Bayes factor is Σ(pi/qi)pi. The negative logarithm is a convex function; hence by Jensen's inequality, -log[E(P(e|H)/P(e|~H))] ≤ -E[log(P(e|H)/P(e|~H))]. That last expectation is Σ(log(pi/qi))pi. This is the Kullback–Leibler divergence of P(e|~H) from P(e|H), and hence is non-negative. Thus log[E(P(e|H)/P(e|~H))] ≥ 0, and hence

 

  • E(P(e|H)/P(e|~H)) ≥ 1.

Thus, in expectation, the Bayes factor, for the true hypothesis versus its negation, is greater than or equal to one.

Note that this is not true for the inverse. Indeed E(P(e|~H)/P(e|H)) = Σ(qi/pi)pi = Σqi = 1.

In the preceding proofs, ~H played no specific role, and hence

 

  • For all K,    E(P(e|H)) ≥ E(P(e|K))    and    E(P(e|H)/P(e|K)) ≥ 1    (and E(P(e|K)/P(e|H)) = 1).

Thus, in expectation, the probability of the true hypothesis versus anything, is greater or equal in both absolute value and ratio.

Now we can turn to the posterior probability P(H|e). For e=ei, this is P(H)*P(e=ei|H)/P(e=ei). We can compute the expectation of P(e|H)/P(e) as above, using the non-negative Kullback–Leibler divergence of P(e) from P(e|H), and thus showing it has an expectation greater than or equal to 1. Hence:

 

  • E(P(H|e)) ≥ P(H).

Thus, in expectation, the posterior probability of the true hypothesis is greater than or equal to its prior probability.

Genetic "Nature" is cultural too

7 Stuart_Armstrong 18 March 2016 02:33PM

I'll admit it: I am confused about genetics and heritability. Not about the results of the various twin studies - Scott summarises them as "~50% of the variation is heritable and ~50% is due to non-shared environment", which seems generally correct.

But I am confused about what this means in practice, due to arguments like "contacts are very important for business success, rich people get much more contacts than poor people, yet business success is strongly correlated with genetic parent wealth" and such. Assuming that genetics strongly determines... most stuff... goes against so many things we know or think we know about how the world works. And by "we" I mean lots of different people with lots of different political views - genetic determinism means, for instance, that current variations in regulation and taxes are pretty unimportant for individual outcomes.

Now, there are many caveats about the genetic results, particularly that they measure the variance of a factor rather than its absolute importance (and hence you get results like variation in nutrition being almost invisible as an explanation for variation in height), but it's still hard to figure out what this all means.

Then we have Scott's latest post, which points out that "non-shared environment" is not the same as "nurture", since it includes, for instance, dumb luck.

However, "heritable" is not the same as as "nature", either. For instance, sexism and racial prejudices, if they are widespread, come under the "heritable" effects rather than the "environment" ones. And then it gets even more confusing.

 

Widespread prejudice is not "environment". Rarer prejudice is.

For instance, imagine that we lived in a very sexist society where women were not allowed to work at all. Then there would be an extremely high, almost perfect, correlation between "having a Y chromosome" and "having a job". But this would obviously be susceptible to a cultural fix.

Obviously racial effects can have the same effect. It covers anything visible. So a high heritability is compatible with genetics being a cause of competence, and/or prejudice against visible genetic characteristics being important ("Our results indicate that we either live in a meritocracy or a hive of prejudice!").

Note that as prejudices get less widespread, they move from showing up on the genetic variation, to showing up in the environmental variation side. So widespread prejudices create a "nature" effect, rarer ones create a "nurture" effect. Evenly reducing the magnitude of a prejudice, however, doesn't change the side it will show up on.

 

Positional genetic goods: Beauty... and IQ?

Let's zoom in on one of those visible genetic characteristics: beauty. As Robin Hanson is fond of pointing out, beautiful people are more successful, and are judged as more competent and cooperative than they actually are. Therefore if we have a gene that increases both beauty and IQ, we would expect it's impact on success to be high. In the presence of such a gene, the correlation between IQ and success would be higher than it should objectively be. This suggest a (small) note of caution on the "mutation load" hypotheses; if reducing mutation load increases factors such as beauty, then we would expect increased success without necessarily increased competence.

But is it possible that IQ itself is in part a positional good? Consider that success doesn't just depend on competence, but on social skills, ability to present yourself well in an interview, and how managers and peers judge you. If IQ affects or covaries with one or another of those skills, then we would be overemphasising the importance of IQ in competence. Thus attempts to genetically boost IQ could give less impact than expected. The person whose genome was changed would benefit, but at the (partial) expense of everyone else.

Do people know of experiments (or planned experiments) that disentangle these issues?

True answers from AI: Summary

4 Stuart_Armstrong 10 March 2016 03:56PM

A putative new idea for AI control; index here.

This post summaries the methods for getting honest answers from certain AIs, presented in the previous two posts.

You can basically safely ask an AI for expected utility estimates (including conditional ones), probability estimates, and get the AI to identify events of highest and lowest probability and utility, and, arguably, highest and lowest conditional utility.

The setup is a boxed AI, which emits a message M, and a stochastic stochastic process E that would erase M with low probability, before anyone could see it. There is also a stochastic process that emits a message Y. Let v_E be the indicator function for E and v_y the indicator function for Y=y.

Then, given a utility u bounded between 0 and 1, and a specific y, we can give the AI a utility u# that will cause it to give us the message M_1={q,q^y,p^y,q^{|y}} (which we will see with high probability P(¬E), an event the AI will not optimise for).

Define f(q,u)=-q^2 +2qu. The utility is then

  • u#(1) = v_E(f(q,u) + f(q^y,uv_y) + f(p^y,v_y) + f(q^{|y},u)v_y).

Maximising this u# ensures that, in the AI's best estimate,

  1. q = E(u),
  2. q^y = E(u ∧ (Y=y)),
  3. p^{y} = P(Y=y),
  4. q^{|y} = q^{y}/p^{y}=E(u|Y=y).

But we need not consider any specific y. The AI could send out the message M_2={y^+,y^-,y^{p+},y^{p-},q^+,q^-,p^+,p^-}, while maximising the utility

  • u#(2) = v_E (f(q^+,uv_{y^+}) + f(q^-,(1-u)v_{y^-}) + f(p^+,v_{y^{p+}}) + f(p^-,1-v_{y^{p-}})).

Then M will be such that, in the AI's best estimate,

  1. (y^+,q^+) = (argmax_y, max_y) E(u ∧ (Y=y)),
  2. (y^-,q^-) = (argmin_y, min_y) E(u ∧ (Y=y)),
  3. (y^{p+},p^+) = (argmax_y, max_y) P(Y=y),
  4. (y^{p-},p^-) = (argmin_y, min_y) P(Y=y),

Finally, define M_{3,r} = {y^{|+},y^{|-},q^{|+},q^{|-}}. And set f_r(q,u)= -(r-1)q^r+ rq^{r-1}u, and define the utility function:

  • u#(3,r) = v_E v_{y^{|+}}f_r(q{|+},u) + v_E v_{y^{|-}}f_r(q^{|-},1-u).

Then, for large enough r, the AI will output M so that, according to its best estimate,

  1. (y^{|+},q^{|+}) = (argmax_y, max_y) E(u|(Y=Y^+)),
  2. (y^{|-},q^{|-}) = (argmin_y, min_y) E(u|(Y=Y^+)).

Unfortunately the value of r cannot be known ahead of time. It might be possible to define a "limit" utility u#(3, ∞).

Note that all these messages and utilities are independent, so the AI can generate messages M_1 + M_2 + M_{3,r} + M_{3,r'} when maximising

  • u#(1) + u#(2) + u#(3,r) + u#(3,r').

But there are issues with very low probabilities, as explained in the previous post.

Toy model: convergent instrumental goals

8 Stuart_Armstrong 25 February 2016 02:03PM

tl;dr: Toy model to illustrate convergent instrumental goals.

Steve Omohundro identified 'AI drives' (also called 'Convergent Instrumental goals') that almost all intelligent agents would converge to:Self-improve

  1. Be rational
  2. Protect utility function
  3. Prevent counterfeit utility
  4. Self-protective
  5. Acquire resources and use them efficiently

This post will attempt to illustrate some of these drives, by building on the previous toy model of the control problem, which was further improved by Jaan Tallinn.

continue reading »

Goal completion: noise, errors, bias, prejudice, preference and complexity

4 Stuart_Armstrong 18 February 2016 02:37PM

A putative new idea for AI control; index here.

This is a preliminary look at how an AI might assess and deal with various types of errors and uncertainties, when estimating true human preferences. I'll be using the circular rocket model to illustrate how these might be distinguished by an AI. Recall that the rocket can accelerate by -2, -1, 0, 1, and 2, and the human wishes to reach the space station (at point 0 with velocity 0) and avoid accelerations of ±2. In the forthcoming, there will generally be some noise, so to make the whole thing more flexible, assume that the space station is a bit bigger than usual, covering five squares. So "docking" at the space station means reaching {-2,-1,0,1,2} with 0 velocity.



continue reading »

Goal completion: algorithm ideas

4 Stuart_Armstrong 25 January 2016 05:36PM

A putative new idea for AI control; index here.

This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I'll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).

In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the "features" of the state-action space. Features are functions φi from the full state-action space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the "minor modification"). These features form a vector φ∈[0,1]k, for k different features. The actual reward is given by the inner product with a vector w∈ℝk, thus the reward at state-action pair (s,a) is

R(s,a)=w.φ(s,a).

To ensure the reward is always between -1 and 1, w is constrained to have ||w||1 ≤ 1; to reduce redundancy, we'll assume ||w||1=1.

The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is

μ(π) = E(Σt γtφ(st,π(st)),

where st is the state at step t.

The agent's expected reward is then simply

E(R) = w . μ(π).

Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct policy, we just need to find one whose expected features are close enough to optimal; this need not involve computing w.

continue reading »

View more: Prev | Next