This is a linkpost for https://arxiv.org/abs/2406.18344
That's a fascinating idea. Using the human brain voxel maps as guidance would presumably also be possible for text as they did for images, and seems like it would help us assess how human-like the ontology and internal workings of a model are and to what extent the natural abstractions hypothesis is true, at least for LLMs.
Combining and comparing this to VAEs might also be very illuminating.
Alternatively, for less costly to acquire guidance than the human brain, how about picking a (large) reference model and attempting to use (smaller) models to predict its activations across layers at some granularity?
Authors: Huzheng Yang, James Gee, Jianbo Shi.
Abstract: