I'm not sure if these would be classed as "weird tricks" and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
- SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
- Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
- RMSNorm: Layernorm but without the translation.
- Rotary Position Embeddings: Rotates token embeddings to give them positional information.
- Quantization: Fewer bit weights without much drop in performance.
- Flash Attention: More efficient attention computation through better memory management.
- Various sparse attention schemes
How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?