You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Manfred comments on Open thread, Aug. 03 - Aug. 09, 2015 - Less Wrong Discussion

5 Post author: MrMind 03 August 2015 07:05AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (177)

You are viewing a single comment's thread. Show more comments above.

Comment author: Manfred 08 August 2015 05:31:31AM *  0 points [-]

I don't think he suggests bayesian networks (which, to me, mean the causal networks of Pearl et al). Rather, he is literally suggesting trying to learn by Bayesian inference. His comments about nonlinearity I think are just to the effect that one shoudn't have to introduce nonlinearity with sigmoid activation functions, one should have nonlinearity naturally from Bayesian updates. But yeah, I think it's quite impractical.

E.g. suppose you wanted to build an email spam filter, and wanted P(spam). A (non-naive) Bayesian approach to this classification problem might involve a prior over some large population of email-generating processes. Every time you get a training email, you update your probability that a generic email comes from a particular process, and what their probability was of producing spam. When run on a test email, the spam filter goes through every single hypothesis, evaluates its probability of producing this email, and then takes a weighted average of the spam probabilities of those hypotheses to get its spam / not spam verdict. This seems like too much work.

Comment author: Houshalter 08 August 2015 09:57:43PM 0 points [-]

I don't know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn't really fit what he said about learning the "exact" nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.

Your example sounds like a bayesian net. But it doesn't really fit his description of learning optimal nonlinearities for functions.