TLDR

This post explores creating an agent which tries to make a certain number of paperclips without caring about the variance in the number it produces, only the expected value. This may be safer than one which wants to reduce the variance, since humans are a large source of variance and randomness. I accomplished this by modifying a Q-learning agent to choose actions which minimized the difference between the expected number of paperclips and the target number. In experiments the agent is sometimes indifferent to variance, but usually isn't, depending on it's random initialization and what data it sees. Feedback from others was largely negative. Therefore, I'm not pursuing the idea at this time.

Motivation

One safety problem pointed out in Superintelligence is that if you reward an optimizer for making between 1900 and 2000 paperclips, it will try to eliminate all sources of variance to make sure that the number of paperclips is within that range. This is dangerous because humans are a large source of variance, so the agent would try to get rid of humans. This post investigates a way to make an agent that will try to get the expected number of paperclips as close as possible to a particular value without caring about the variance of that number. For example if the target number of paperclips is 2000, and it's choosing between one action action which gets 2000 clips with 100% probability, and another action which gets 1000 paper clips with 50% probability and 3000 paperclips with 50% probability, it will be indifferent between those 2 actions. Note that I’m using paper clips as the example here, but the idea could work with anything measurable.

Solution

My goal is to have an agent which tries to get the expected number of paperclips as close to a certain value as possible while ignoring the expected variance.

To see how to accomplish this goal, it helps to look at regular Q learning. Regular Q learning to maximize the number of paperclips is similar to this because the agent ignores the variance in the expected number. The equation for regular Q learning to maximize the number of paperclips is

Where is the number of paperclips the agent earned at time t (the reward). I consider the non-discounted case for simplicity. This can be rewritten as

Where

The part of the equation means that the Q value is updated with the expected number of paperclips. The part means that the actions are selected to maximize the expected value of future paperclips. Therefore, this Q learning agent will be indifferent to variance, but it will be trying to maximize the number of paperclips which isn't what I want.

So to accomplish the goal stated above, I want to still update the Q value with the expected number of paperclips, so I leave the part of the equation the same. However, I want the action to be chosen to bring the expected value of the paperclips close as possible to the target amount, not to maximize them. Therefore, I change the part to

Where is the target number of paperclips. Therefore, the agent will choose its action to minimize the difference between the expected number of paperclips and the target. It won't care about the variance, because the Q value only represents the expected number of future paperclips, not the variance.

Note that if the standard Q learning agent is just changed to have a reward of , it will try to minimize the variance in the number of paperclips. For example, if the standard Q-learning agent is choosing between action "a" which gets 2000 clips with 100% probability, and action "b" which gets 1000 paper clips with 50% probability and 3000 paperclips with 50% probability, it will choose action "a" if the target is 2000 clips. The reward for action "a'' will be 0, and the reward for action "b" will be -1000. However, the modified agent will be indifferent between the two actions. The modified Q value represents the expected number of paperclips, and the agent chooses the action which makes the expected number of clips as close as possible to the target. In the example, the expected number of paperclips is both 2000, so will give 0 for both actions, so the agent will be indifferent. In practice, the action that it chooses will be determined by the accuracy of the Q value estimations.

Feedback from others

When I shared this idea with others, I got largely negative feedback. The main problem that was pointed out is that humans probably won't only affect the variance in the number of paperclips, they will affect the expected value as well. So this agent will still want to get rid of humans if it predicts humans will make the expected value of paperclips farther away from the target.

Also, the agent is too similar to a maximizer, since it will still do everything it can to make sure the expected value of the paperclips is as close as possible to the target.

Several other problems were pointed out with the idea:

The method would only work to have an agent get something closer to the target, which isn't that interesting.

Even if it is indifferent between getting rid of humans or not, there's nothing motivating it not to get rid of humans.

Experiment

I used a very simple environment to test this idea. The agent has the actions to create a paperclip, get rid of a human, create a human, or noop. Humans randomly create or destroy paper clips. The agent starts out with 100 paper clips, and the target is also 100 paper clips. The episode terminates after 3 timesteps.

I tested a regular Q learning agent which had a reward function of , and the variance indifferent agent, with a target of 100. I did a hyperparameter search where each agent was trained from scratch in the environment. I recorded the average number of humans remaining for several runs of each agent after it was trained.

Results

As expected, the regular Q learning always gets rid of all humans for every combination of hyperparameters. 49% of the variance indifferent agents ended up with fewer humans then they started with on average, 13% had the same number, and 38% ended up with more. This means most of the runs of the variance indifferent agent actually preferred to get rid of or create humans to change the variance of the paperclips. I couldn't find hyperparameter settings which consistently lead to an agent which was indifferent to variance.

I think the reason for these results is that the Q values for different actions are learned from past data. The data from the action to get rid of a human will be less noisy than the Q value for creating a human, because humans behave randomly. So the Q value for creating a human will probably be slightly above or below 100 depending on noise, while the Q value for getting rid of a human will usually be closer to 100.

Conclusion

For now I'm not pursuing this idea because the experimental results were mixed, and I agree with the negative feedback I got from others. However, I think the potential to modify the Q learning algorithm to get safer agent behavior is interesting. Please comment if you think it's worth pursuing, or think of a way to solve one of the problems with the idea.

New Comment
1 comment, sorted by Click to highlight new comments since:

First, thanks for posting about this even though it failed. Success is built out of failure, and it's helpful to see it so that it's normalized.

Second, I think part of the problem is that there's still not enough constraints on learning. As others notice, this mostly seems to weaken the optimization pressure such that it's slightly less likely to do something we don't want but doesn't actively make it into something that does things we do want and not those we don't.

Third and finally, what this most reminds me of is impact measures. Not in the specific methodology, but in the spirit of the approach. That might be an interesting approach for you to consider given that you were motivated to look for and develop this approach.