On the side - could you elaborate why you think "relu better than sigmoid" is a "weird trick", if that is implied by this question?
The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).
I read this and, it said:
have we found anything like this? at all? have we seen any "weird tricks" discovered that make AI way more powerful for no reason?