I find CNNs a lot less intuitive than RNNs. In which context was training many filters and successively apply pooling and again filters to smaller versions of the output an intuitive idea?
In the context of vision. Pooling is not strictly necessary but makes things go a bit faster - the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they're moved around (rather than having different processes for recognition for different parts of the image).
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.