I agree that reducing superposition is probably valuable even if it requires a significantly larger network. I still don't understand why the transition from float to binary would cause a dramatic reduction in superposition capacity. But if it does prevent superposition, great! I'll just give it more parameters as needed. But if we still get superposition, I will need to apply other techniques to make it stop.
(I have not yet finished my closer re-read of Toy Models of Superposition after my initial skimming. Perhaps once I do I will understand better.)
Hopefully in a few months I will have empirical data regarding how much more neurons we need. Then I can stop hand waving about vague intuitions.
If we can get the unwanted cognition/behaviors to sit entirely in their own section of weights, we can then ablate the unwanted behaviors without losing wanted capability. That's my hope anyway.
My thoughts and hope as well.
I'm glad we agree that RNNs are nice.
So if I understand correctly, you are saying:
Consider that most of the trits will be 0 and thus removable, and that we will be replacing the activations with boolean logic and applying logic simplification transformations to discard even more nodes. The number of trits in the weights is not the same as the number of gates in the resulting logic graph. I think it plausible that even if we are forced to start with a LLM of greater than chinchilla size to achieve comparable accuracy, after sparsification and logic simplification we will end up with significantly fewer gates. Would such a LLM still be less interpretable?
If you want to be competitive with SOTA, a more quantized net will need a lot more neurons (have you read the new article on superposition?).
I agree that lower precision weights will likely requires somewhat more weights, however I do not see the connection to superposition. It is possible to embed >n features in n bits (assuming some feature sparsity). The features will be on the unit corners, but most of the area is there anyway, I do not think it would be a very large decrease in available space.
and that you would still need specialized tools to get anywhere.
I agree with this. I am currently attempting to build the needed tooling. It's nontrivial work, but I think it is doable.
Discretized weights/activation are very much not amenable to the usual gradient descent. :) Hence the usual practice is to train in floating point, and then quantize afterwords. Doing this naively tends to cause a big drop in accuracy, but there are tricks involving gradually quantizing during training, or quantizing layer by layer.
Current NN matrices are dense and continuous weighted. A significant part of the difficulty of interpretability is that they have all to all connections; it is difficult to verify that one activation does or does not affect another activation.
However we can quantize the weights to 3 bit and then we can probably melt the whole thing into pure combinational logic. While I am not entirely confident that this form is strictly better from an interpretability perspective, it is differently difficult.
"Giant inscrutable matrices" are probably not the final form of current NNs, we can potentially turn them into different and nicer form.