I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised... (read more)
I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised... (read more)