New essay summarizing some of my latest thoughts on AI safety, ~3500 words. I explain why I think that some of the thought experiments that have previously been used to illustrate the dangers of AI are flawed and should be used very cautiously, why I'm less worried about the dangers of AI than I used to be, and what are some of the remaining reasons for why I do continue to be somewhat worried.


Backcover celebrity endorsement: "Thanks, Kaj, for a very nice write-up. It feels good to be discussing actually meaningful issues regarding AI safety. This is a big contrast to discussions I've had in the past with MIRI folks on AI safety, wherein they have generally tried to direct the conversation toward bizarre, pointless irrelevancies like "the values that would be held by a randomly selected mind", or "AIs with superhuman intelligence making retarded judgments" (like tiling the universe with paperclips to make humans happy), and so forth.... Now OTOH, we are actually discussing things of some potential practical meaning ;p ..." -- Ben Goertzel

New Comment
27 comments, sorted by Click to highlight new comments since: Today at 8:30 PM

Few people, when learning their values in childhood, ended up considering examples such as this one and explicitly learning that they were wrong. Yet the persuasive power of that example comes from most people instantly reject the desirability of the dopamine drip scenario when it’s suggested to them.

I for one don't "instantly reject" the desirability of this scenario. I think it's a difficult philosophy problem as to whether dopamine drip is desirable or not. My worry is that either the AI will not be as uncertain as I am about it, or it will not handle or resolve the normative uncertainty in the same way as I would or should.

Today's machine learning algorithms tend to be unreasonably certain (and wrong) about inputs very different from their training data, but that is perhaps just due to machine learning researchers currently focusing mostly on commercial settings where inputs are rarely very different from training data, and there aren't terrible consequences for getting things wrong. So maybe we can expect this to improve in the future as researchers start to focus more on safety.

But even if we manage to build an AI that is properly uncertain about whether something like the dopamine drip scenario is good or bad, how do we get it to resolve its uncertainty in the right way, especially if its creators/owners are also uncertain or possibly wrong so it can't just ask? Resolving the uncertainty incorrectly or getting the uncertainty permanently frozen into its utility function seem to be two big risks here. So I worry just as much about the reverse maverick nanny scenario, where we eventually, after centuries of philosophical progress, figure out that we actually do want to be put on dopamine drips, but the AI says "Sorry, I can't let you do that."

Today's machine learning algorithms tend to be unreasonably certain (and wrong) about inputs very different from their training data

Read about covariate shift. (More generally ML people are getting into systematic biases now, including causal inference, in a big way).


This has little to do with AGI, though.

You have understood Loosemore's point but you're making the same mistake he is. The AI in your example would understand the intent behind the words "maximize human happiness" perfectly well but that doesn't mean it would want to obey that intent. You talk about learning human values and internalizing them as if those things naturally go together. The only way that value internalization naturally follows from value learning is if the agent already wants to internalize these values; figuring out how to do that is (part of) the Friendly AI problem.

Yes, I'm quite aware of that problem. It was outside the scope of this particular essay, though it's somewhat implied by the deceptive turn and degrees of freedom hypotheses.

I think it would help the discussion to distinguish more between knowing what human values are and caring about them--that is, between acquiring instrumental values and acquiring terminal ones. The "human enforcement" section touches on this, but I think too weakly: it seems indisputable that an AI trained naively via a reward button would acquire only instrumental values, and drop them as soon as it could control the button. This is a counterexample to the Value Learning Thesis if interpreted as referring to terminal values.

An obvious programmer strategy would be to cause the AI to acquire our values as instrumental values, then try to modify the AI to make them terminal.

The important part here is emphasizing that this is an argument aimed at laymen

That's not how EY sold this, he thinks AI researchers are dangerous idiots, essentially. But also: why are you aiming anything at laymen? Who cares what laymen think? Is this for fund-raising?


I liked your essay.

But also: why are you aiming anything at laymen?

Charitable answers:

  • That's "laymen relative to understanding general AI considerations"; you could still be e.g. a philosopher or narrow-AI focused researcher whose opinion was relevant, but who didn't have the background knowledge to realize this particular bit.
  • At the time when that paper was published, MIRI was still more focused on an outreach than a research role, and getting laymen interested in the field so they could eventually become non-laymen was important. Also the article in which the "dopamine drip" example came from, was in the New Yorker, so obviously something aimed at a popular audience.

Less charitable answers:

  • At least some of us were also laymen at the time when some of those articles were written, and didn't have enough knowledge to realize that this argument was kinda silly from a more sophisticated perspective. I don't want to imply this for any others since I don't know what was going on in their heads, but I did personally read drafts of e.g. IE&ME and had a chance to comment it but didn't catch this bit. And I'm pretty sure that my failure to catch it wasn't because of a conscious "well this is a little off but it's okay as an argument for laymen" calculation.

I liked your essay.

Thanks!

Option 3: most human beings would (at best) drug inconvenient people into submission if they had the power, and the ones talking as if we had a known way to avoid this are the ones who look naive.

Thanks for writing this; a couple quick thoughts:

For example, it turns out that a learning algorithm tasked with some relatively simple tasks, such as determining whether or not English sentences are valid, will automatically build up an internal representation of the world which captures many of the regularities of the world – as a pure side effect of carrying out its task.

I think I've yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world. For some papers that refute this claim, see e.g.

http://arxiv.org/abs/1312.6199 http://arxiv.org/abs/1412.6572

I think the Degrees of Freedom thesis is a good statement of one of the potential problems. Since it's essentially making a claim about whether a certain very complex statistical problem is identifiable, I think it's very hard to know whether it's true or not without either some serious technical analysis or some serious empirical research --- which is a reason to do that research, because if the thesis is true then that has some worrisome implications about AI safety.

http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html is also relevant--tl;dr Google Photos classifies a leopard-print sofa as a leopard. I think this lends credence to the 'treacherous turn' insofar as it's an example of a classifier seeming to perform well and breaking down in edge cases.

The classifier isn't breaking down - it was trained to do well across the entire training set using a small amount of computation for each inference and a reasonable (larger) amount of computation for training.

Human's fastest recognition capability still takes 100 ms or so, and operating in that mode (rapid visual presentation), human inference accuracy is considerably less capable than modern ANNs - which classify using less time and also around 1000x less neurons/synapses.

I would bet that humans often make similar mistakes in fast recognition. And even if humans don't make this specific mistake, it doesn't matter because they make more total mistakes in other categories.

The fact that humans can do better given considerably more time and enormously more neural resources is hardly surprising (involving more complex multi-step inference steps).

Also, the ImageNet training criterion is not really a good match for human visual intuitions. It assigns the same penalty for mistaking a dog for a cat as it does for mistaking two closely related species of dogs. Humans have a more sensible hierarchical error allocation. This may be something that is relatively easy to improve low-hanging fruit for ANNs, not sure - but someone is probably working on that if it hasn't already been done.

Human's fastest recognition capability still takes 100 ms or so, and operating in that mode (rapid visual presentation), human inference accuracy is considerably less capable than modern ANNs.

This doesn't seem right, assuming that "considerably less capable" means "considerably worse accuracy at classifying objects not drawn from ImageNet". Do you have a study in mind that shows this? In either case, I don't think this is strong enough to support the claim that the classifier isn't breaking down --- it's pretty clearly making mistakes where humans would find the answer obvious. I don't think that saying that the ANN answers more quickly is a very strong defense.

Do you have a study in mind that shows this?

Comparing different recognition systems is complex, and it's important to compare apples to apples. CNNs are comparable only to rapid feedforward recognition in the visual system which can be measured with rapid serial presentation experiments. In an untimed test the human brain can use other modules, memory fetches, multi-step logical inferences, etc (all of which are now making their way into ANN systems, but still).

The RSP setup ensures that the brain can only use a single feedforward pass from V1 to PFC, without using more complex feedback and recurrent loops. It forces the brain to use a network configuration similar to what current CNN used - CNNs descend from models of that pathway, after all.

In those test CNNs from 2013 rivaled primate IT cortex representations 1, and 2015 CNNs are even better.

That paper uses a special categorization task with monkeys, but the results generalize to humans as well. There are certainly some mistakes that a CNN will make which a human would not make even with the 150ms time constraint, but the CNNs make less mistakes for the more complex tasks with lots of categories, whereas humans presumably still have lower error for basic recognition tasks (but to some extent that is because researchers haven't focused much on getting to > 99.9% accuracy on simpler recognition tasks).

Cool, thanks for the paper, interesting read!

and breaking down in edge cases

Except that from a human point of view a leopard-print sofa isn't an edge case at all.

I think I've yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world

Taboo natural representations?

Without defining a natural representation (since I don't know how to), here's 4 properties that I think a representation should satisfy before it's called natural (I also give these in my response to Vika):

(1) Good performance on different data sets in the same domain.

(2) Good transference to novel domains.

(3) Robustness to visually imperceptible perturbations to the input image.

(4) "Canonicality": replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.

Thanks.

So to clarify, my claim was not that we'd yet have algorithms producing representations that would fulfill all of these criteria. But it would seem to me that something like word embeddings would be moving towards the direction of fulfilling these. E.g. something like this bit from the linked post:

Recently, deep learning has begun exploring models that embed images and words in a single representation.

The basic idea is that one classifies images by outputting a vector in a word embedding. Images of dogs are mapped near the “dog” word vector. Images of horses are mapped near the “horse” vector. Images of automobiles near the “automobile” vector. And so on.

The interesting part is what happens when you test the model on new classes of images. For example, if the model wasn’t trained to classify cats – that is, to map them near the “cat” vector – what happens when we try to classify images of cats?

It turns out that the network is able to handle these new classes of images quite reasonably. Images of cats aren’t mapped to random points in the word embedding space. Instead, they tend to be mapped to the general vicinity of the “dog” vector, and, in fact, close to the “cat” vector. Similarly, the truck images end up relatively close to the “truck” vector, which is near the related “automobile” vector.

This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.

The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.

sounds to me like it would be represent clear progress towards at least #1 and #2 of your criteria.

I agree that the papers on adversarial examples that you cited earlier are evidence that many current models are still not capable of meeting criteria #3, but on the other hand the second paper does seem to present clear signs that the reasons for the pathologies are being uncovered and addressed, and that future algorithms will be able to avoid this class of pathology. (Caveat: I do not yet fully understand those papers, so may be interpreting them incorrectly.)

I think I've yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world. For some papers that refute this claim, see e.g.

My impression that they can in fact learn "natural" representations of the world, a good example here http://arxiv.org/abs/1311.2901

On the other hand since they tend to be task-specific learners they might take shortcuts that we wouldn't perceive as "natural"; our "natural object" ontology is optimized for much more general task than most NNets.

If I'm correct about this I would expect NNets to become more "natural" as the tasks get closer to being "AI-complete", such as question-answering systems and scene description networks.

Here's an example of recurrent neural nets learning intuitive / interpretable representations of some basic aspects of text, like keeping track of quotes and brackets: http://arxiv.org/abs/1506.02078

I know there are many papers that show that neural nets learn features that can in some regimes be given nice interpretations. However in all cases of which I am aware where these representations have been thoroughly analyzed, they seem to fail obvious tests of naturality, which would include things like:

(1) Good performance on different data sets in the same domain.

(2) Good transference to novel domains.

(3) Robustness to visually imperceptible perturbations to the input image.

Moreover, ANNs almost fundamentally cannot learn natural representations because they fail what I would call the "canonicality" test:

(4) Replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.

Note that the reason for (4) is that if you want to interpret an individual hidden unit in an ANN as being meaningful, then it can't be the case that a random linear combination of lots of units is equally meaningful (since a random linear combination of e.g. cats and dogs and 100 other things is not going to have much meaning).

That was a bit long-winded, but my question is whether the linked paper or any other papers provide representations that you think don't fail any of (1)-(4).

I don't see that (4) should be necessary; I may be misunderstanding it.

If you apply a change of basis to the inputs to a non-linearity, then I'm sure it will destroy performance. If you apply a change of basis to the outputs, then those outputs will cease to look meaningful, but it won't stop the algorithm from working well. But just because the behavior of the algorithm is robust to applying a particular linear scrambling doesn't mean that the representation is not natural, or that all of the scrambled representations must be just as natural as the one we started with.

Yeah I should be a bit more careful on number 4. The point is that many papers which argue that a given NN is learning "natural" representations do so by looking at what an individual hidden unit responds to (as opposed to looking at the space spanned by the hidden layer as a whole). Any such argument seems dubious to me without further support, since it relies on a sort of delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself. But I agree that if such an argument was accompanied by justification of why the training procedure or data noise or some other factor led to the symmetry being broken in a natural way, then I would potentially be happy.

delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself

I'm still not convinced. The pointwise nonlinearities introduce a preferred basis, and cause the individual hidden units to be much more meaningful than linear combinations thereof.

Yeah; I discussed this with some others and came to the same conclusion. I do still think that one should explain why the preferred basis ends up being as meaningful as it does, but agree that this is a much more minor objection.

Thanks for the handy list of criteria. I'm not sure how (3) would apply to a recurrent neural net for language modeling, since it's difficult to make an imperceptible perturbation of text (as opposed to an image).

Regarding (2): given the impressive performance of RNNs in different text domains (English, Wikipedia markup, Latex code, etc), it would be interesting to see how an RNN trained on English text would perform on Latex code, for example. I would expect it to carry over some representations that are common to the training and test data, like the aforementioned brackets and quotes.

The Deceptive Turn Thesis seems almost unavoidable if you start from the assumptions "the AI doesn't place an inhumanly high value on honesty" and "the AI is tested on inputs vaguely resembling the real world". That latter assumption is probably unavoidable, unless it turns out that human values can be so generalized as to be comprehensible in inhuman settings. If we're stuck testing an AI in a sandbox that resembles reality then it can probably infer enough about reality to know when it would benefit by dissembling.