Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
In my previous post, I explored the distributional properties of transformer activations, finding that they follow mixture distributions dominated by logistic-like or even heavier tailed primary components with minor modes in the tails and sometimes in the shoulders. Note that I have entirely ignored dimension here, treating each value in...
Jan 10, 20254