jsteinhardt comments on [link] New essay summarizing some of my latest thoughts on AI safety - Less Wrong

14 Post author: Kaj_Sotala 01 November 2015 08:07AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (27)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vika 09 November 2015 01:48:57AM 0 points [-]

Here's an example of recurrent neural nets learning intuitive / interpretable representations of some basic aspects of text, like keeping track of quotes and brackets: http://arxiv.org/abs/1506.02078

Comment author: jsteinhardt 10 November 2015 04:04:05AM *  2 points [-]

I know there are many papers that show that neural nets learn features that can in some regimes be given nice interpretations. However in all cases of which I am aware where these representations have been thoroughly analyzed, they seem to fail obvious tests of naturality, which would include things like:

(1) Good performance on different data sets in the same domain.

(2) Good transference to novel domains.

(3) Robustness to visually imperceptible perturbations to the input image.

Moreover, ANNs almost fundamentally cannot learn natural representations because they fail what I would call the "canonicality" test:

(4) Replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.

Note that the reason for (4) is that if you want to interpret an individual hidden unit in an ANN as being meaningful, then it can't be the case that a random linear combination of lots of units is equally meaningful (since a random linear combination of e.g. cats and dogs and 100 other things is not going to have much meaning).

That was a bit long-winded, but my question is whether the linked paper or any other papers provide representations that you think don't fail any of (1)-(4).

Comment author: paulfchristiano 10 November 2015 05:36:17AM 2 points [-]

I don't see that (4) should be necessary; I may be misunderstanding it.

If you apply a change of basis to the inputs to a non-linearity, then I'm sure it will destroy performance. If you apply a change of basis to the outputs, then those outputs will cease to look meaningful, but it won't stop the algorithm from working well. But just because the behavior of the algorithm is robust to applying a particular linear scrambling doesn't mean that the representation is not natural, or that all of the scrambled representations must be just as natural as the one we started with.

Comment author: jsteinhardt 10 November 2015 08:29:52AM *  1 point [-]

Yeah I should be a bit more careful on number 4. The point is that many papers which argue that a given NN is learning "natural" representations do so by looking at what an individual hidden unit responds to (as opposed to looking at the space spanned by the hidden layer as a whole). Any such argument seems dubious to me without further support, since it relies on a sort of delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself. But I agree that if such an argument was accompanied by justification of why the training procedure or data noise or some other factor led to the symmetry being broken in a natural way, then I would potentially be happy.

Comment author: paulfchristiano 15 November 2015 01:15:19AM 0 points [-]

delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself

I'm still not convinced. The pointwise nonlinearities introduce a preferred basis, and cause the individual hidden units to be much more meaningful than linear combinations thereof.

Comment author: jsteinhardt 15 November 2015 07:48:24AM 0 points [-]

Yeah; I discussed this with some others and came to the same conclusion. I do still think that one should explain why the preferred basis ends up being as meaningful as it does, but agree that this is a much more minor objection.

Comment author: Vika 14 November 2015 12:30:01AM 0 points [-]

Thanks for the handy list of criteria. I'm not sure how (3) would apply to a recurrent neural net for language modeling, since it's difficult to make an imperceptible perturbation of text (as opposed to an image).

Regarding (2): given the impressive performance of RNNs in different text domains (English, Wikipedia markup, Latex code, etc), it would be interesting to see how an RNN trained on English text would perform on Latex code, for example. I would expect it to carry over some representations that are common to the training and test data, like the aforementioned brackets and quotes.