This was more of a research strategy than a specific project, and my foci have shifted substantially since this post.
Thanks to John Wentworth for pointers on an early draft.
TL;DR: I'm starting work on the Natural Abstraction Hypothesis from an overly-general formalization, and narrowing until it's true. This will start off purely information-theoretic, but I expect to add other maths eventually.
Motivation
My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.
That was, until I found John Wentworth's works. We seem to have similar intuitions about a lot of things. I think this is an opportunity for stacking.
Counter-proofs narrow our search
If we start off with an accurate and overly-general model, we can prune with counter-proofs until we're certain where NAH works. I think that this approach seems better fit for me than starting ground-up.
Pruning also yields clear insights about where our models break; maybe NAH doesn't work if the system has property X, in which case we should ensure frontier AI doesn't have X.
Examples
As Nate Soares mentioned, we don't currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools "play nicer" when most of the problem is specified in such terms.
KL-Divergence and causal network entropy
Let be some probability distribution (our environment) and be our model of it, where is isomorphic to a causal network with nodes and edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?
The Kullback-Leibler divergence of two probability distributions and tells us how accurately predicts , on average. This gives us a measure of world-model accuracy; what about abstraction similarity?
We could take the cross-entropy of each causal network , and compare it to the training limit .
Then our question looks like: Can we parameterize in terms of , and ? Does (where is minimal) converge?
Probably not the way I want to; this is my first order of business.
Also, I don't know if it would be better to call the entire causal graph "abstractions", or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.
And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.
Modularity
I expect that capability measures like KL-divergence won't imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.
Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.
Larger models
Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs' abstractions? Can we parameterize that convergence?
This is the telos of the project; the True Name of natural abstractions in superintelligences.
Shoot, thanks. Hopefully it's clearer now.