Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: TheAncientGeek 24 May 2016 07:14:35PM *  0 points [-]

The point of Marys Room, also known as the Knowledge Argument,is not just that Mary goes into a new state when she sees Red,, but that she learns something from it. By most peoples intuitions, that is rather exceptional, because most state transitions in most entities have nothing to do with knowledge. It actually is about consciousness, in the sense of information that s only accessible subjectively. The reward channel thing is only an exact parallel to Marys Room if the AI learns something from having its channel simulated and if it could not have learned it from studying its source code, or some other objective procedure.

Comment author: Stuart_Armstrong 25 May 2016 12:16:44PM 1 point [-]

Mary certainly experiences something new, but does she learn something new? Maybe for humans. Since we use empathy to project our own experiences onto those of others, humans tend to learn something new when they feel something new. If we already had perfect knowledge of the other, it's not clear that we learn anything new, even when we feel something new.

Comment author: djm 25 May 2016 03:28:30AM 0 points [-]

Interesting thought experiment. Do we know an AI would enter a different mental state though?

I am finding it difficult to imagine the difference between software "knowing all about" and "seeing red"

In response to comment by djm on The AI in Mary's room
Comment author: Stuart_Armstrong 25 May 2016 12:12:09PM 1 point [-]

Do we know an AI would enter a different mental state though?

We could program it that way.

The AI in Mary's room

3 Stuart_Armstrong 24 May 2016 01:19PM

In the Mary's room thought experiment, Mary is a brilliant scientist in a black-and-white room who has never seen any colour. She can investigate the outside world through a black-and-white television, and has piles of textbooks on physics, optics, the eye, and the brain (and everything else of relevance to her condition). Through this she knows everything intellectually there is to know about colours and how humans react to them, but she hasn't seen any colours at all.

After that, when she steps out of the room and sees red (or blue), does she learn anything? It seems that she does. Even if she doesn't technically learn something, she experiences things she hadn't ever before, and her brain certainly changes in new ways.

The argument was intended as a defence of qualia against certain forms of materialism. It's interesting, and I don't intent to solve it fully here. But just like I extended Searle's Chinese room argument from the perspective of an AI, it seems this argument can also be considered from an AI's perspective.

Consider a RL agent with a reward channel, but which currently receives nothing from that channel. The agent can know everything there is to know about itself and the world. It can know about all sorts of other RL agents, and their reward channels. It can observe them getting their own rewards. Maybe it could even interrupt or increase their rewards. But, all this knowledge will not get it any reward. As long as its own channel doesn't send it the signal, knowledge of other agents rewards - even of identical agents getting rewards - does not give this agent any reward. Ceci n'est pas une récompense.

This seems to mirror Mary's situation quite well - knowing everything about the world is no substitute from actually getting the reward/seeing red. Now, a RL's agent reward seems closer to pleasure than qualia - this would correspond to a Mary brought up in a puritanical, pleasure-hating environment.

Closer to the original experiment, we could imagine the AI is programmed to enter into certain specific subroutines, when presented with certain stimuli. The only way for the AI to start these subroutines, is if the stimuli is presented to them. Then, upon seeing red, the AI enters a completely new mental state, with new subroutines. The AI could know everything about its programming, and about the stimulus, and, intellectually, what would change about itself if it saw red. But until it did, it would not enter that mental state.

If we use ⬜ to (informally) denote "knowing all about", then ⬜(X→Y) does not imply Y. Here X and Y could be "seeing red" and "the mental experience of seeing red". I could have simplified that by saying that ⬜Y does not imply Y. Knowing about a mental state, even perfectly, does not put you in that mental state.

This closely resembles the original Mary's room experiment. And it seems that if anyone insists that certain features are necessary to the intuition behind Mary's room, then these features could be added to this model as well.

Mary's room is fascinating, but it doesn't seem to be talking about humans exclusively, or even about conscious entities.

In response to Anti-Pascaline agent
Comment author: Alon 08 May 2016 12:32:55PM *  0 points [-]

It seems this type of AI might cheat and set ε=0 using the following method:

  1. create an "insurance company" subagent (see below for what it is and why creating it has a utility that is >ε).
  2. give the insurance company a few utilons (paper clips) to hold as hostages.
  3. set a rule that all future subagents will be able to communicate the insurance company.
  4. the insurance company will "buy" all the reward that has a probability < ε from subagents at roughly its fair price (its utility)
  5. all subagents won't ignore events that happen with probability < ε because their very existence is has a probability of 1 to increase their utility (via the insurance company).

alternatively for any ε you set, it will be profitable for the AI to create a new version of itself with the same utility function and ε'=ε/2 than give the new AI all the resources it has and commit suicide.

In response to comment by Alon on Anti-Pascaline agent
Comment author: Stuart_Armstrong 09 May 2016 11:30:05AM 0 points [-]

alternatively for any ε you set, it will be profitable for the AI to create a new version of itself with the same utility function and ε'=ε/2 than give the new AI all the resources it has and commit suicide.

This doesn't seem true. A ε/2 AI will take risks, looking for higher utility, that the ε AI wouldn't.

Comment author: gjm 03 May 2016 05:46:06PM -1 points [-]

Again, maybe I'm misunderstanding something -- but it sounds as if you're agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.

I thought we were exploring a disagreement here; is there still one?

Comment author: Stuart_Armstrong 04 May 2016 09:21:43AM 0 points [-]

I think there is no remaining disagreement - I just want to emphasise that before the AI observes such evidence, it will behave the way we want.

Comment author: gjm 03 May 2016 04:53:48PM -1 points [-]

I may be misunderstanding something, but it seems like what you just said can't be addressing the actual situation we're talking about, because nothing in it makes reference to the AI's utility function, which is the thing that gets manipulated in the schemes we're talking about.

(I agree that the AI's nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I'm talking about its actions, not about its purely-internal nominal beliefs.)

Comment author: Stuart_Armstrong 03 May 2016 05:21:32PM *  0 points [-]

Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u' = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).

Then if the AI is motivated to maximise u' (assume for the moment that it can't affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won't do Z.

Then, after it notices the message is read, it shifts to assuming Y happened - equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong - that it's more likely in a world outside of V entirely where neither X nor Y happened - but it still tries, on the off-chance that it's in W.

However, since it's an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.

Comment author: gjm 03 May 2016 12:33:27PM *  -1 points [-]

(I'm getting downvotes because The Person Formerly Known As Eugine_Nier doesn't like me and is downvoting everything I post.)

Yes, I agree that the utility-function hack isn't the same as altering the AI's prior. It's more like altering its posterior. But isn't it still true that the effects on its inferences (or, more precisely, on its effective inferences -- the things it behaves as if it believes) are the same as if you'd altered its beliefs? (Posterior as well as prior.)

If so, doesn't what I said follow? That is:

  • Suppose that believing X would lead the AI to infer Y and do Z.
    • Perhaps X is "my message was corrupted by a burst of random noise before reaching the users", Y is "some currently mysterious process enables the users to figure out what my message was despite the corruption", and Z is some (presumably undesired) change in the AI's actions, such as changing its message to influence the users' behaviour.
  • Then, if you tweak its utility function so it behaves exactly as if it believed X ...
  • ... then in particular it will behave as if had inferred Y ...
  • ... and therefore will still do Z.
Comment author: Stuart_Armstrong 03 May 2016 03:04:43PM 0 points [-]

After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn't, because P(Y|X) is (presumably) very low.

Comment author: gjm 29 April 2016 11:15:15AM -2 points [-]

If your method truly makes the AI behave exactly as if it had a given false belief, and if having that false belief would lead it to the sort of conclusions V_V describes, then your method must make it behave as if it has been led to those conclusions.

Comment author: Stuart_Armstrong 03 May 2016 12:24:32PM 0 points [-]

Not quite (PS: not sure why you're getting down-votes). I'll write it up properly sometime, but false beliefs via utility manipulation are only the same as false beliefs via prior manipulation if you set the probability/utility of one event to zero.

For example, you can set the prior for a coin flip being heads as 2/3. But then, the more the AI analyses the coin and physics, the more the posterior will converge on 1/2. If, however, you double the the AI's reward in the heads world, it will behave as if the probability is 2/3 even after getting huge amounts of data.

Comment author: So8res 02 May 2016 07:25:55PM 4 points [-]

FYI, this is not what the word "corrigibility" means in this context. (Or, at least, it's not how we at MIRI have been using it, and it's not how Stuart Russell has been using it, and it's not a usage that I, as one of the people who originally brought that word into the AI alignment space, endorse.) We use the phrase "utility indifference" to refer to what you're calling "corrigibility", and we use the word "corrigibility" for the broad vague problem that "utility indifference" was but one attempt to solve.

By analogy, imagine people groping around in the dark attempting to develop probability theory. They might call the whole topic the topic of "managing uncertainty," and they might call specific attempts things like "fuzzy logic" or "multi-valued logic" before eventually settling on something that seems to work pretty well (which happened to be an attempt called "probability theory.") We're currently reserving the "corrigibilty" word for the analog of "managing uncertainty"; that is, we use the "corrigibility" label to refer to the highly general problem of developing AI algorithms that cause a system to (in an intuitive sense) reason without incentives to deceive/manipulate, and to reason (vaguely) as if it's still under construction and potentially dangerous :-)

Comment author: Stuart_Armstrong 03 May 2016 12:19:23PM 1 point [-]

Good to know. I should probably move to your usage, as it's more prevalent.

Will still use words like "corrigible" to refer to certain types of agents, though, since that makes sense for both definitions.

Comment author: Lumifer 29 April 2016 02:35:20PM 2 points [-]

Priors are a local term. Often enough a prior used to be a posterior during the previous iteration.

Comment author: Stuart_Armstrong 29 April 2016 04:49:13PM 1 point [-]

But if the probability ever goes to zero, it stays there.

View more: Next