You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Vika comments on [link] New essay summarizing some of my latest thoughts on AI safety - Less Wrong Discussion

14 Post author: Kaj_Sotala 01 November 2015 08:07AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (27)

You are viewing a single comment's thread. Show more comments above.

Comment author: jsteinhardt 10 November 2015 04:04:05AM *  2 points [-]

I know there are many papers that show that neural nets learn features that can in some regimes be given nice interpretations. However in all cases of which I am aware where these representations have been thoroughly analyzed, they seem to fail obvious tests of naturality, which would include things like:

(1) Good performance on different data sets in the same domain.

(2) Good transference to novel domains.

(3) Robustness to visually imperceptible perturbations to the input image.

Moreover, ANNs almost fundamentally cannot learn natural representations because they fail what I would call the "canonicality" test:

(4) Replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.

Note that the reason for (4) is that if you want to interpret an individual hidden unit in an ANN as being meaningful, then it can't be the case that a random linear combination of lots of units is equally meaningful (since a random linear combination of e.g. cats and dogs and 100 other things is not going to have much meaning).

That was a bit long-winded, but my question is whether the linked paper or any other papers provide representations that you think don't fail any of (1)-(4).

Comment author: Vika 14 November 2015 12:30:01AM 0 points [-]

Thanks for the handy list of criteria. I'm not sure how (3) would apply to a recurrent neural net for language modeling, since it's difficult to make an imperceptible perturbation of text (as opposed to an image).

Regarding (2): given the impressive performance of RNNs in different text domains (English, Wikipedia markup, Latex code, etc), it would be interesting to see how an RNN trained on English text would perform on Latex code, for example. I would expect it to carry over some representations that are common to the training and test data, like the aforementioned brackets and quotes.