Contrapositive Natural Abstraction - Project Intro

Elliot Callender

This was more of a research strategy than a specific project, and my foci have shifted substantially since this post.

Thanks to John Wentworth for pointers on an early draft.

TL;DR: I'm starting work on the Natural Abstraction Hypothesis from an overly-general formalization, and narrowing until it's true. This will start off purely information-theoretic, but I expect to add other maths eventually.

Motivation

My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.

That was, until I found John Wentworth's works. We seem to have similar intuitions about a lot of things. I think this is an opportunity for stacking.

Counter-proofs narrow our search

If we start off with an accurate and overly-general model, we can prune with counter-proofs until we're certain where NAH works. I think that this approach seems better fit for me than starting ground-up.

Pruning also yields clear insights about where our models break; maybe NAH doesn't work if the system has property X, in which case we should ensure frontier AI doesn't have X.

Examples

As Nate Soares mentioned, we don't currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools "play nicer" when most of the problem is specified in such terms.

KL-Divergence and causal network entropy

Let be some probability distribution (our environment) and $M$ be our model of it, where $M$ is isomorphic to a causal network $B$ with $k$ nodes and $n$ edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?

The Kullback-Leibler divergence $D_{KL} (Y | | X)$ of two probability distributions $X$ and $Y$ tells us how accurately $X$ predicts $Y$ , on average. This gives us a measure of world-model accuracy; what about abstraction similarity?

We could take the cross-entropy $H (X, Y)$ of each causal network $B_{i}$ , and compare it to the training limit $B_{\infty}$ .

Then our question looks like: Can we parameterize $a r g m a x_{H} (H (B_{i}, B_{\infty}))$ in terms of $D_{KL} (U | | M)$ , $n$ and $k$ ? Does $B_{\infty}$ (where $D_{KL} (U | | M_{\infty})$ is minimal) converge?

Probably not the way I want to; this is my first order of business.

Also, I don't know if it would be better to call the entire causal graph "abstractions", or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.

And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.

Modularity

I expect that capability measures like KL-divergence won't imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.

Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.

Larger models

Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs' abstractions? Can we parameterize that convergence?

This is the telos of the project; the True Name of natural abstractions in superintelligences.

since each edge can be a computationally intractable function.

Also, neural nets only compute to some precision, and often work prety well if that precision is reduced to 4–8 bits, which is a pretty significant limit on their computational capacity compared to assuming arbitrary precision.

Yes, I agree. I expect abstractions, typically, to involve much more than 4-8 bits of information. On my model, any neural network, be it MLP, KAN or something new, will approximate abstractions with multiple nodes in parallel when the network is wide enough. I.e. the causal graph I mentioned is very distinct from the NN which might be running it.

Though now that you mentioned it, I wonder if low-precision NN weights are acceptable because of some network property (maybe SGD is so stochastic that higher precision doesn't help) or the environment (maybe natural latents tend to be lower-entropy)?

Anyways, thanks for engaging. It's encouraging to see someone comment.

in terms of $D_{KL} (U | | M)$ , $n$ and $k$ ? Does $B_{\infty}$ (where $D_{KL} (U | | M_{\infty})$

You're assuming a lot of familiarity with your chosen notation, i.e. you probably just lost a significant fraction of your readers — I'd suggest spending a few sentences defining terminology.

Shoot, thanks. Hopefully it's clearer now.