I don't know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn't really fit what he said about learning the "exact" nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.
Your example sounds like a bayesian net. But it doesn't really fit his description of learning optimal nonlinearities for functions.
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.