This is a shorter summary of the post "mAIry's room: AI reasoning to solve philosophical problems". Its aim is to explain the key concepts of that post without going into full technical details.

Can you feel my pain?

People can be more or less empathetic, but we have trouble really feeling the pain of another person. If the person is close to us, and we get more details of what they're going through, then we can start to feel a sympathetic pain; but almost nobody reacts viscerally to a sentence like "and then someone in the country started to feel a lot of pain."

So, though we can feel our own pain, and though we can to some extent feel the pain of others if given enough detail, the information that someone else is in pain is very different from the experience of pain.

Now, this might be where you could start taking about "qualia"; but the key insight is that this behaviour is very akin to reinforcement learning agents.

The "pain" of RL agents

You can see "low reward" as a crude analogue of "pain" for a reinforcement learning agent.

It has some of the features mentioned above. An RL agent's own reward channel is much more important to it than the reward channel of another agent. If it gets some information about low reward for other agents, this will not be of any interest to it unless this affects its own reward.

We can make the analogy closer if we imagine more limited, boundedly rational RL agents[1]; maybe powered by neural nets under the hood. These neural nets are function approximators; but we can also think of them as heuristic generators. They generate approximate answers to specific questions (eg: "what action best increases my reward?") and update themselves based on inputs and outputs.

Now let's imagine this agent has been in a rather unchanging environment for some time, getting medium rewards, with the neural net refining itself but not changing much, as its predictions are very decent. In such an environment, the agent has the luxury of making long-term plans.

And then, suddenly, the rewards drop to something very low. The agent's neural net starts to update massively, since its predictions are now quite wrong. Weights and connections that haven't changed in a long time are suddenly shifting. Meanwhile, the agent's behaviour changes radically, as it tries to figure out a behaviour, any behaviour, that will bring its rewards back up. The long-term is ignored - its not clear that long-term plans will pan out, anyway, if the short-term predictions are so wrong.

To over-anthropomorphise, we could say that the agent is panicking, that it's trying desperately to avoid the low reward, focusing on little else, and attempting anything it can to get away from the low reward.

That seems akin to a human pain response. More properly, it seems to share features with that pain response, even if the analogy is not exact.

As agents learn...

What is learning for an RL agent? Well, if we assume that it has neural nets under the hood, then learning new information is mainly updating the weights in these nets.

If it wanted to keep track of whether it was learning, then it would keep track of these weights. So, again abusing terminology slightly, the "feeling of learning" for an RL agent is it detecting major shifts in its weights.

mAIry's room

The above showed the similarity between "pain" for a human and "low reward" for a RL agent. So it is worth asking if there are other features of qualia that might appear in such an artificial context.

The original "Mary's room" thought experiment Mary has been confined to a grey room from birth, exploring the outside world only through a black-and-white monitor.

Though isolated, Mary is a brilliant scientist, and has learnt all there is to know about light, the eye, colour theory, human perception, and human psychology. It would seem that she has all possible knowledge that there could be about colour, despite having never seen it.

Then one day she gets out of her room, and says "wow, so that's what purple looks like!".

Has she learnt anything new here? If not, what is her exclamation about? If so, what is this knowledge - Mary was supposed to know everything there was to know about colour already?

mAIry and the content of "wow"

Let's replace Mary with an artificial agent mAIry, and see if we can reproduce the "wow" experience for it - and hope that this might give some insights into what Mary is experiencing.

Now, mAIry has a collection of neural nets, dedicated to interpreting its input streams and detect the features of the world that is then fed to its mental models. So we could imagine the setup as follows:

Here mAIry's input cameras are looking at a purple object. These inputs go through various neural nets that detect the presence of various features. The colour purple is just one of these features, that may or may not be triggered.

These features then go to the mental model and more rational part of mAIry, which can analyse them and construct a picture of the outside world, and of itself.

So, to get back to mAIry seeing purple for the first time. As a brilliant scientist, mAIry has learnt everything there is to know about the properties of purple, and the properties of its own neural nets and mental models. So, it can model itself within itself:

And then it sees purple, for real, for the first time. This triggers its "purple detector", which has never triggered until now. This new data will also shift the weights in this part of its neural nets, so mAIry will "detect that it is learning". So it will have a new "experience" (the triggering of its purple feature detector) and a feeling of learning (as it detects that its weights have shifted).

That is the "wow" experience for the mAIry: new experiences, and a feeling of learning. Since it knows everything about itself and the world, it knows that it will "feel" this, ahead of time. But, even with this knowledge, it will not actually feel these until it actually sees purple.

This account of mAIry's experiences seems close enough to Mary's, to suggest that both of them are purely physical processes, and that they are comparable to some extent.


  1. Another way we could make the RL agent more similar to humans is to make it more empathetic, by including the rewards of other agents into its own reward. If we combine that with some imperfect heuristics for estimating the "pain" of other agents, then we could start to get the human-like behaviour where nearby evident pain affects the agent more than distant, formally described pain. ↩︎

New to LessWrong?

1.

Another way we could make the RL agent more similar to humans is to make it more empathetic, by including the rewards of other agents into its own reward. If we combine that with some imperfect heuristics for estimating the "pain" of other agents, then we could start to get the human-like behaviour where nearby evident pain affects the agent more than distant, formally described pain. ↩︎

1.

In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there's no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated. ↩︎

New Comment


2 comments, sorted by Click to highlight new comments since:

The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!

My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).

Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them "evaluators", that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with "pain" and "pleasure". We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don't affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).


  1. In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there's no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated. ↩︎

This isn't key for your point, but:

In TD learning, if from some point the model always perfectly predicted the future

If it's a perfect predictor of a deterministic world, sure. But if the world is stochastic, or you can't assume realizability, your network can simultaneously be a global optimum but also have gradient updates. It's just that in expectation, your gradient is zero, but if you update in sufficiently small batches, you might still have non-zero gradients.