Overview
One of the active research areas for interpretability involves distilling neural network activations into clean, labeled features. This is made difficult because of superposition, where a neuron may fire in response to multiple, disparate signals making that neuron polysemantic. To date, research has focused on one type of such superposition: compressive superposition where a network can represent more features than it has neurons. I report on another type of superposition that can arise when a network has more neurons than features: “symmetric mixtures”. Essentially, this is a form of “favored basis” that allows a network to reinforce the magnitude of its logits via parallelism. I believe understanding this concept can help flesh... (read 1029 more words →)