Interpretable by Design - Constraint Sets with Disjoint Limit Points
cart;horse: How can we constrain our models to be interpretable? Convex, linear sets make for more interpretable parameter spaces, and the simplex and the Birkhoff Polytope are great examples of this that have other desirable properties. An interpretation is something explicit, something discrete, something that compresses, something that summarizes. Our current paradigms do not lend themselves well to this. We may be able to fine-tune models and interpretations, via approaches built on Provable Guarantees for Model Performance via Mechanistic Interpretability, but in some sense we are fighting an uphill battle against "an uninterpretable base". In the same way we want to create models that are inherently not capable of deception rather than having to evaluate if an unknown model is deceptive, we should aim to create models that are interpretable by default rather than applying interpretability post-hoc. Building interpretable architectures and models from scratch with the explicit goal of "simple" explanations isn't impossible. Interpretability in most cases is compression and discretization, and given both 1) evidence that models can be compressed to effectively one bit per parameter, and 2) now-mature training schemes that work across a wide breadth of model architectures (e.g., Adam), creating networks with these properties is likely possible. This post is a rough exploration of one direction that seems worth exploring, motivated by some rough set theory and geometry. I tried to keep it short, and there's an appendix with a bunch of other related ideas and followups. Unbounded Sets are Hard to Interpret Let's say we have a variable or parameter θ∈R. How do we interpret it? We can see what the value is and see how far it is from some other number, like 0 or 1. We can also see how it relates to some other parameter. But we don’t have any way to interpret it without adding additional points, or by imposing a group or field to give us relevance to the ident
I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of "reasonable" first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won't by default be true though, you'd still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).