In the context of vision. Pooling is not strictly necessary but makes things go a bit faster - the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they're moved around (rather than having different processes for recognition for different parts of the image).
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.