I'm not sure if these would be classed as "weird tricks" and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?
On the side - could you elaborate why you think "relu better than sigmoid" is a "weird trick", if that is implied by this question?
The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).
I read this and, it said:
have we found anything like this? at all? have we seen any "weird tricks" discovered that make AI way more powerful for no reason?