I had an interesting debate recently, about whether we could make smart AIs safe just by focusing on their structure and their task. Specifically, we were pondering something like:
- "Would an algorithm be safe if it was a neural net-style image classifier, trained on examples of melanoma to detect skin cancer, with no other role than to output a probability estimate for a given picture? Even if "superintelligent", could such an algorithm be an existential risk?"
Whether it's an existential risk was not resolved; but I have a strong intuition that they would like be manipulative. Let's see how.
The requirements for manipulation
For an algorithm to be manipulative, it has to derive some advantage from manipulation, and it needs to be able to learn to manipulate - for that, it needs to be able to explore situations where it engages in manipulation and this is to its benefit.
There are certainly very simple situations where manipulation can emerge. But that example, though simple, had an agent that was active in the world. Can a classifier display the same sort of behaviour?
Manipulation emerges naturally
To show that, picture the following design. The programmers have a large collection of slightly different datasets, and want to train the algorithm on all of them. The loss function is an error rate, which can vary between and . Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.
How have the programmers set up the system? Well, they run the algorithm on batched samples from ten datasets at once, and record the error rate for all ten. The hyperparameters are set to minimise average error over each run of ten. When the performance on one dataset falls below error for a few runs, they remove it from the batches, and substitute in a new one to train the algorithm on[1].
So, what will happen? Well, the system will initially start to minimise the loss rate. Then, after a while, it will start "noticing" that when this error is below , then the dataset is removed and the average error rate shoots up. Actually, it won't be "noticing" this; instead, if there are hyperparameters that prevent the error rate going below , these will be selected by gradient descent as giving a better performance.
So, this problem of optimisation of image classification will result in the algorithm manipulating its programmers, by "underperforming" on key datasets, so that they stay in the batch.
Generic manipulation
Of course, the manipulation above happened because the programmers didn't understand what the algorithm's true loss function was. They thought it was "minimise overall loss on classification", but it was actually "keep each dataset loss just above ".
But that kind of mistake is easy to make - by carelessness, or because they thought it would improve the situation, the loss they thought they had diverged from the actual loss, and this resulted in manipulative behaviour.
Here is another example: suppose the programmers build in a precaution for things going wrong. If the error rate suddenly doubles on some dataset, then the algorithm is run exclusively on that dataset for a while, until the error rate declines again (or, if it doesn't, the programmers are called).
Now suppose that, of the ten datasets it has, nine have an error rate around , and one has an error rate of . Then it is in the algorithm's interest to boost the error rate on that last dataset, suddenly, to . Then it will be run for a time only on that one dataset, giving it a lower average error rate. Thus hyperparameters that allow this kind of behaviour, will get selected for (and if the algorithm is run on a lot of datasets, it will have many opportunities to learn about this kind of behaviour).
What about the "if it doesn't, the programmers are called" clause? This depends on what happens when the programmers are called, and how the error rate is computed in that case. But, depending on what does happen, this incentivises the algorithm to learn to call the programmers often (double an error rate and keep it high) or rarely (if there's a risk of the error rate for a dataset going up suddenly, then don't lower it too much in the first place).
So, manipulative behaviour can emerge in smart algorithms, even if they're merely classifying images in a way the programmers feel is entirely harmless.
Existential risks
Can this mild kind of manipulation lead to extreme, existential risk outcomes? Of course it; it's not hard to create toy models where everything ends disastrously. But it's also not hard to create toy models where manipulation remains mild.
But, in conclusion:
- Manipulative behaviours can emerge in seemingly innocuous algorithms.
- There is no clear division, currently, between mild manipulation and disastrous manipulation.
Assume that either catastrophic forgetting has been solved, or that they'll re-run the removed datasets occasionally, to refresh the algorithm's performance on that dataset. ↩︎
It intuitively seems like you need merely make the interventions run at higher permissions/clearance than the hyperparameter optimizer.
What do I mean by that? In Haskell, so-called monad transformers can add features like nondeterminism and memory to a computation. The natural conflict that results ("Can I remember the other timelines?") is resolved through the order in which the monad transformers were applied. (One way is represented as a function from an initial memory state to a list of timelines and a final memory state, the other as a list of functions from an initial memory state to a timeline and a final memory state.) Similarly, a decent type system should just not let the hyperparameter optimizer see the interventions.
What this might naively come out to is that the hyperparameter optimizer just does not return a defined result unless its training run is finished as it would have been without intervention. A cleverer way I could imagine it being implemented is that the whole thing runs on a dream engine, aka a neural net trained to imitate a CPU at variable resolution. After an intervention, the hyperparameter optimizer would be run to completion on its unchanged dataset at low resolution. For balance reasons, this may not extract any insightful hyperparameter updates from the tail of the calculation, but the intervention would remain hidden. The only thing we would have to prove impervious to the hyperparameter optimizer through ordinary means is the dream engine.
Have fun extracting grains of insight from these mad ramblings :P
This specific problem could easily be fixed, but the problem of the goal not being what we think it is, remains.
See also Kaj's example: https://www.lesswrong.com/posts/Ez4zZQKWgC6fE3h9G/almost-every-powerful-algorithm-would-be-manipulative#vhZ9uvMwiMCepp6jH