Epistemic status: Early-stage model, and I'm relatively new to AI safety. This is also my first LessWrong post, but please hold my ideas and writing to a higher bar. Prioritize candidness over politeness.

Thanks to John Wentworth for pointers on an early draft.


TL;DR: I'm starting work on the Natural Abstraction Hypothesis from an over-general formalization, and narrowing until it's true. This will start off purely information-theoretic, but I expect to add other maths eventually.

 

Motivation

My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.

That was, until I found John Wentworth's works. We seem to have similar intuitions about a whole lot of things. I think this is an opportunity for stacking.

 

If we start off with an accurate and over-general model, we can prune with counter-proofs until we're certain where NAH works. I think that this approach seems better fit for me than starting ground-up.

Pruning also yields clear insights about where our model breaks; maybe NAH doesn't work if the system has property X, in which case we should ensure frontier AI doesn't have X.

 

Examples

As Nate Soares mentioned, we don't currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools "play nicer" when most of the problem is specified in such terms.

 

KL-Divergence and causal network entropy

Let  be some probability distribution (our environment) and  be our model of it, where  is isomorphic to a causal network  with  nodes and  edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?

The Kullback-Leibler divergence  of two probability distributions  and  tells us how accurately  predicts , on average. This gives us a measure of world-model accuracy; what about abstraction similarity?

We could take the cross-entropy  of each causal network , and compare it to the training limit .

Then our question looks like: Can we parameterize  in terms of  and ? Does  (where  is minimal) converge?

Probably not the way I want to; this is my first order of business.

Also, I don't know if it would be better to call the entire causal graph "abstractions", or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.

And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.

 

Modularity

I expect that capability measures like KL-divergence won't imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.

Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.

 

Larger models

Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs' abstractions? Can we parameterize that convergence?

This is the telos of the project; the True Name of natural abstractions in superintelligences.

New Comment
5 comments, sorted by Click to highlight new comments since:

since each edge can be a computationally intractable function.

 

Also, neural nets only compute to some precision, and often work prety well if that precision is reduced to 4–8 bits, which is a pretty significant limit on their computational capacity compared to assuming arbitrary precision.

Yes, I agree. I expect abstractions, typically, to involve much more than 4-8 bits of information. On my model, any neural network, be it MLP, KAN or something new, will approximate abstractions with multiple nodes in parallel when the network is wide enough. I.e. the causal graph I mentioned is very distinct from the NN which might be running it.

Though now that you mentioned it, I wonder if low-precision NN weights are acceptable because of some network property (maybe SGD is so stochastic that higher precision doesn't help) or the environment (maybe natural latents tend to be lower-entropy)?

Anyways, thanks for engaging. It's encouraging to see someone comment.

 in terms of  and ? Does  (where 

You're assuming a lot of familiarity with your chosen notation, i.e. you probably just lost a significant fraction of your readers — I'd suggest spending a few sentences defining terminology.

Shoot, thanks. Hopefully it's clearer now.

Yes: thanks!