Freshman’s dream sparsity loss
A similar regularizer is known as Hoyer-Square.
Pick a value for and a small . Then define the activation function in the following way. Given a vector , let be the value of the th-largest entry in . Then define the vector by
Is in the following formula a typo?
To clarify, I thought it was about superposition happening inside the projection afterwards.
This happens in transformer MLP layers. Note that the hidden dimen
Is the point that transformer MLPs blow up the hidden dimension in the middle?
Activation additions in generative models
Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.
From a conversation on Discord:
Do you have in mind a way to weigh sequential learning into the actual prior?
Dmitry:
good question! We haven't thought about an explicit complexity measure that would give this prior, but a very loose approximation that we've been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the "BIMT" weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)
Response:
Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I'm not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.
I'm thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.
So, just some smoothness constraint that would basically integrate over paths SGD could take.
You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.