Standard literature in mechanistic interpretability—specifically Representation Engineering (RepE)—suggests a strict "perplexity ceiling" for activation steering. We attempted to identify the breaking point of semantic weights in a Pythia-1.4B model by applying a "Sledgehammer" penalty (α=10.0) using a custom Centroid Repulsion loss. Contrary to expectations of catastrophic forgetting, the model remained stable. This post details the anomaly, proposes a distinction between the fragility of activations and the plasticity of weights, discusses the saturation curve we discovered, and examines the "Sticky Prior" failure mode that emerged from extreme geometric steering.
If you have worked with Activation Steering (adding a vector to the residual stream at inference time), you know the drill: you identify a vector representing a concept (e.g., "Honesty"), and you add it with a coefficient...
Thank you Komponisto,
I have read many explanations of Bayesian theory, and like you, if I concentrated hard enough I could follow the reasoning , but I could never reason it out for myself. Now I can. Your explanation was perfect for me. It not only enabled me to "grok" the Monty Hall problem, but Bayesian calculations in general, while being able to retain the theory.
Thank you again, Ben