A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.
What’s up with that?
The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don’t, with at least enough agreement that the customer’s drink-type matches the customer’s expectations. And the two inductors reach that agreement without learning the category from huge amounts of labeled data - one inductor points at an instance, another inductor points at another instance, and then the first inductor gets the kind of drink it expected. Why (and when) are the two inductors able to coordinate on roughly the same categorization?
Most existing work on Solomonoff inductors, Kolmogorov complexity, or minimum description length can’t say much about this sort of thing. The problem is that the customer/bartender story is all about the internal structure of the minimum description - the (possibly implicit) “categories” which the two inductors use inside of their minimal descriptions in order to compress their raw data. The theory of minimum description length typically treats programs as black boxes, and doesn’t attempt to talk about their internal structure.
In this post, we’ll show one potential way to solve the puzzle - one potential way for two minimum-description-length-based minds to coordinate on a categorization.
Main Tool: Natural Latents for Minimum Description Length
Fundamental Theorem
Here’s the main foundational theorem we’ll use. (Just the statement for now, more later.)
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Mediation: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i)
Redundancy: For all i, (Λ′,ϕ′i) is an approximately-shortest string such that TM(Λ′,ϕ′i) = xi.[1]
Then: the K-complexity of Λ′ given Λ,K(Λ′|Λ), is approximately zero - in other words, Λ′ is approximately determined by Λ, in a K-complexity sense.
(As a preview: later we’ll assume that both Λ and Λ′ satisfy both conditions, so both K(Λ′|Λ)andK(Λ|Λ′) are approximately zero. In that case, Λ and Λ′ are “approximately isomorphic” in the sense that either can be computed from the other by a short program. We’ll eventually tackle the customer/bartender puzzle from the start of this post by suggesting that Λ and Λ′ each encode a summary of things in one category according to one inductor, so the theorem then says that their category summaries are “approximately isomorphic”.)
The Intuition
What does this theorem mean intuitively?
Let’s start with the first condition: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i). Notice that there’s a somewhat-trivial way to satisfy that condition: take Λ to be a minimal description of the whole dataset {xi}, take ϕi=i, and then add a little bit of code to Λ to pick out the datapoint at index ϕi[2]. So TM(Λ,ϕi) computes all of {xi} from Λ, then picks out index i. Now, that might not be the only approximately-minimal description (though it does imply that whatever approximately-minimal Λ,ϕ we do use is approximately a minimal description for all of x). Conceptually, insofar as there’s information in xi which is irrelevant to all the other xj’s, we could “move” that information from Λ into ϕi. But any information which is relevant to more than one xi has to be in Λ.
Now let’s consider the second condition: for all i, (Λ′,ϕ′i) is an approximately-shortest string such that TM(Λ′,ϕ′i) = xi. Again, there’s a somewhat-trivial way to satisfy the condition: take Λ′ to be empty, and ϕ′i to be a minimal description of xi. Again, that might not be the only approximately-minimal description. Conceptually, insofar as there’s information which is redundantly represented across all the xi’s, that information can be “moved” into Λ′. But Λ′ can only contain information which is redundantly represented across all xi’s.
Put those two together, and we get a simple intuitive statement of the theorem: if Λ contains all information which is relevant to more than one datapoint, and Λ′ contains only information which is redundantly represented across all datapoints, then Λ contains all the information in Λ′.
Intuitively: imagine that Λ is a pipe between xi and xj, and the only way for information to move between xi and xj is through that pipe. Then if some information Λ′ is present in both xi and xj, it must have gone through the pipe. It’s the same idea as deterministic natural latents, but now in a minimum description length setting.
So that’s the idea. Now we’re ready for the actual math.
Background: Working with Kolmogorov Complexity
We kicked around the word “approximation” a lot, in stating the theorem. Approximation in what sense?
When working with Kolmogorov complexity, a prototypical theorem looks like:
Take in the value of some K-complexity, K(X|Y). That means assuming there exists a program of length K(X|Y) which outputs X on input Y. Call that program F (for “function”).
Construct some other program F′ which either uses F as a subroutine, or relies somehow on the existence of F. The length of the new program is either the length of F (i.e. K(X|Y)) plus a little bit of new code, or just a little bit of new code. The theorem says the new code is small in a big-O sense compared to the length of F - i.e. O(1) or O(logK(X|Y)) is typical.
As an example: if Y is a shortest program to compute X, then
Here’s a rough sketch of the program of size O(log K(X)) which outputs Y on input X. The program will have K(X) hardcoded (which takes O(log K(X)) bits). It will also have hardcoded a number k, such that among inputless programs of size K(X) which output X, Y is the kth to halt. It turns out that there can’t be very many minimal programs (though we won’t prove that part here), so log k≤O(log K(X)). The program itself runs all programs of length K(X) until the kth halts and returns X, then returns the kth program to halt and return X; that code requires only a constant number of bits to specify. So, the total size of the program is O(log K(X)) bits.
That’s what the programs involved in prototypical K-complexity theorems tend to look like. The key idea to remember is that, wherever you see O(blah) terms, that usually means there’s some simple wrapper code of the relevant size.
More Precise Statement of the Fundamental Theorem
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Mediation: len(Λ,ϕ1,…,ϕn) ≤ϵ+ minΛ,ϕ1,…,ϕn len(Λ,ϕ1,…,ϕn) subject to (TM(Λ,ϕi) = xi for all i)
Redundancy: For all i, len(Λ′,ϕ′i) ≤ϵ′+minΛ′,ϕ′i len(Λ′,ϕ′i) subject to TM(Λ′,ϕ′i) = xi.
Then: O(ϵ+ϵ′+log K(x)) ≥K(Λ′|Λ).
The proof has been relegated to the appendix.
Implications: Minimality, Maximality, Isomorphism
When a string Λ satisfies both the mediation and redundancy conditions over some strings x1,…,xn, we call Λ a natural latent over x1,…,xn (overloading our terminology for "natural latents" in the probabilistic setting). Equipped with the fundamental theorem, we can now carry over all the standard properties of deterministic natural latents in the probabilistic setting to natural latents in the minimum description length setting.
Minimality
Suppose the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Mediation: there exists ϕ1,…,ϕn such that (Λ,ϕ) has minimal length to within ϵ bits subject to (∀i: TM(Λ,ϕi) = xi).
Redundancy: For all i, there exists ϕ′i such that (Λ,ϕ′i) has minimal length to within ϵ′ bits subject to TM(Λ,ϕ′i) = xi.
Then Λ satisfies the mediation condition, and for any other string Λ′ which satisfies the mediation condition to within ϵ bits, the Fundamental Theorem tells us
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
i.e. there exists a short program to compute Λ from Λ′. So Λ is a “minimal” string which satisfies the mediation condition; it can be computed from any other via a short program.
Maximality
Suppose, again, that the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Mediation: there exists ϕ1,…,ϕn such that (Λ,ϕ) has minimal length to within ϵ bits subject to (∀i: TM(Λ,ϕi) = xi).
Redundancy: For all i, there exists ϕ′i such that (Λ,ϕ′i) has minimal length to within ϵ′ bits subject to TM(Λ,ϕ′i) = xi.
Then Λ satisfies the redundancy condition, and for any other string Λ′ which satisfies the redundancy condition to within ϵ′ bits, the Fundamental Theorem tells us
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
i.e. there exists a short program to compute Λ′ from Λ. So Λ is a “maximal” string which satisfies the redundancy condition; any other can be computed from Λ via a short program.
Isomorphism
Finally, suppose that both the string Λ and the string Λ′ satisfy both mediation and redundancy to within ϵ and ϵ’ bits, respectively. Then
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
and
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
So either of the strings Λ,Λ′ can be computed from the other via a short program. In that sense, the two are “isomorphic”.
Main Takeaway
The main upshot which we’ll use in the next section is the isomorphism property, so we’ll restate it here in full.
Suppose the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Mediation: there exists ϕ1,…,ϕn such that (Λ,ϕ) has minimal length to within ϵ bits subject to (∀i: TM(Λ,ϕi) = xi).
Redundancy: For all i, there exists ϕ′i such that (Λ,ϕ′i) has minimal length to within ϵ′ bits subject to TM(Λ,ϕ′i) = xi.
Assume that Λ′also satisfies both conditions:
Mediation: there exists ϕ1,…,ϕn such that (Λ′,ϕ) has minimal length to within ϵ bits subject to (∀i: TM(Λ′,ϕi) = xi).
Redundancy: For all i, there exists ϕ′i such that (Λ′,ϕ′i) has minimal length to within ϵ′ bits subject to TM(Λ′,ϕ′i) = xi.
Intuitively, these say that both Λ and Λ′ capture approximately all the information which is relevant to more than one xi, and approximately only the information which is redundantly represented across all xi.
Then:
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
In other words: a short program exists to compute Λ from Λ′, and a short program exists to compute Λ′ from Λ.
Back to Solomonoff’s Bar
Let’s talk about the problem faced by our two Solomonoff inductors - the customer and the bartender.
As a (very) simplified model, let’s say that they both start out with matching background models of what generally happens when one places an order in a bar - i.e. certain words are spoken, a liquid shows up in a container, all that jazz. We’ll assume that the two already break up the world into a bunch of individual instances of “drinks” (a pint on the bar over here, a glass on the table over there, …) all of which are observed by both inductors; that will be our most egregious simplification of the problem. We’ll call those individual drink-instances in the world x1,…,xn. The (very simplified) problem the two then face is how to categorize various physical instances of liquids-in-containers, such that when the customer asks for a drink in a certain category (i.e. “one of those” + pointing), they end up with the kind of drink they expected.
By contrast, what would be a bad outcome for the two? Well, maybe the bartender just categorizes drinks by booze content, and the customer just categorizes drinks by color, so the customer thinks they’re asking for a blue drink but the bartender thinks they’re asking for a drink with 10% alcohol content, and the customer unexpectedly-to-them ends up with a pink drink. That’s the sort of thing we want to avoid.
So the customer and the bartender face a coordination problem - both want the customer to not be surprised at their drink. And it’s a coordination problem in the Schelling sense, insofar as the two will not be able to exchange anywhere near enough data to brute-force learn each others’ categorizations.
How can the two use the theorems here to coordinate on a choice of categorization, without exponentially huge amounts of communication? We’ll tackle the problem in two steps:
First, an even simpler problem in which there’s just one drink type, in order to illustrate what parts of the problem are handled by the isomorphism argument.
Second, the simplified customer/bartender problem itself, with multiple drink types.
Even Simpler Subproblem: Just One Drink Type
Suppose there’s just one drink type - maybe this is soviet Russia, and a shot of vodka is the only option. We still have a version of the customer/bartender problem: they should both roughly agree on what a shot of vodka is. If the bartender prepares a thing which they think of as a shot of vodka, the customer should not receive a thing which they think of as, say, a giraffe.
Well, both inductors see x1,…,xn, which are all shot-of-vodka-instances. Suppose they both look for a natural latent over those instances, and both find one: Λ for the customer, Λ′ for the bartender. Then we know that the two have found approximately-isomorphic strings.
Now, the bartender creates a new thing xn+1, which is approximately-optimally compressed as passing a new incompressible string (i.e. noise) to the generator Λ′, just like the bartender models all the other xi’s as having been generated. Λ′ is therefore natural over x1,…,xn+1. And since Λ′ is approximately isomorphic to Λ, Λ is natural over x1,…,xn+1. So, when the customer receives their drink, they see that it’s approximately-optimally compressed by Λ plus some incompressible noise, and it’s also approximately-optimally compressed jointly with x1,…,xn by Λ plus some incompressible noise. It is, in other words, basically what the customer expected.
So that’s the basic story. Now let’s kick it up a notch, and handle multiple drink-types.
Multiple Drink Types: Categorizing by Shared Natural Latent
Our two inductors both start out with the “data points” x1,…,xn, each of which is one physical instance of a drink. Then one obvious move is to look for a natural latent over these strings, or over various subsets of the strings.
In general, there’s no guarantee that any natural latent exists at all (to within reasonable approximation). And even if natural latents do exist over some subsets of the data points, they might have some complicated structure - e.g. maybe there’s a natural latent between x1 and x2 and another between x2 and x3 but those two have very different stuff in them, or maybe there’s hierarchical structure like e.g. a natural latent over the first half of the data and another over the second half and then a natural latent over those two natural latents, or ???. Most of those possibilities don’t look much like “categories of drinks”, at least not in a simple sense. But perhaps our intuitive notion of simple “categories” maps to a specific kind of natural latent structure?
Here’s a simple structure which seems like a decent fit.
Imagine that we have 8 drink-instances x1,…,x8. We find a natural latent Λa between x1 and x2. On further investigation, we find that the same string Λa is natural over all of x1,x2,x5, and x7.[4] Then we find another string Λbwhich is natural over x3 and x4, and a third string Λc which is natural over x6 and x8. Finally, we find that the strings {x1,x2,x5,x7} are independent[5] of the strings {x3,x4} and all those are independent of the strings {x6,x8}.
We can summarize all that with the diagrams:
where we use the “⊥” symbol to denote naturality, and unconnected diagrams next to each other indicate independence.
In this case, it makes a lot of intuitive sense to view each of the three latents as a generator or specification for a certain category of drink. There is no compressible shared information across drinks in different categories, and Λα captures all-and-only the information shared across drinks with category α, for each category α.
If such Λ’s exist, how can the customer and bartender use them to solve their coordination problem?
Well, assume they’re both looking for Λ’s with this structure - i.e. they both want to partition the set of drinks such that there exists a nonempty natural latent over each subset, and independence[6] across subsets. If that’s possible at all, then the partition is unique: pairwise independent strings must go in different subsets, and pairwise non-independent strings must go in the same subset. (If we consider a graph in which each node is one of the strings xi and two nodes are connected iff they are non-independent, then the connected components of that graph must be the subsets of the partition.)[7]
With the partition nailed down, we next invoke uniqueness (up to isomorphism) of natural latents within each subset, i.e. the “isomorphism” result we proved earlier.[8]
As long as the customer and bartender are both looking for this kind of structure in the strings x1,…,xn, and the structure exists to be found in the strings, they will find the same partition (modulo approximation) and approximately isomorphic latents characterizing each class of drinks.
So the customer walks into the bar. They point at a drink - let’s say it’s x7 - and say “One of those, please.”. The bartender has x7 in subset a, with generator Λa. Just to double check, they point to a picture of a drink also generated by Λa, and the customer confirms that’s the type of drink they want. The bartender generates a drink accordingly. And since the customer and bartender have approximately isomorphic Λa, the customer gets what they expected, to within reasonable approximation.
Formal Specification
Both inductors look for
some strings Λ1,…,Λk (conceptually: Λj encodes a generator for drink type j),
integers c1,…,cn in [1,…,k] (conceptually: the category of each drink-instance),
and bitstrings ϕ1,…,ϕn,ϕ′1,…,ϕ′n,ϕ′′1,…,ϕ′′n (conceptually: random noise encoding details of each drink-instance for each of the optimization problems)
such that
Mediation: (Λc∗,ϕi:ci=c∗) is minimal to within ϵ subject to (∀i s.t. ci=c∗: TM(Λci,ϕi) = xi)
Redundancy: For all i, (Λci,ϕ′i) is minimal to within ϵ′ subject to TM(Λci,ϕ′i) = xi
Independence across categories: (Λ,ϕ′′) is minimal to within ϵ′′ subject to (∀i: TM(Λci,ϕ′′i) = xi))
Note that all of these might be conditional on other background knowledge of the inductors - i.e. the Turing machine TM has access to a database full of all the other stuff the inductors know. However, they do need to condition only on things which they expect the other inductor to also know, since we’ve assumed the two inductors use the same Turing machine.[9]
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer. Furthermore, our current best guess is that most of that background knowledge is also itself organized via natural latents, so that the customer and bartender can also coordinate on which knowledge to condition on.
Conclusion & Takeaways
First, let’s recap the core theorem of this post.
Suppose that some string Λ satisfies both the following conditions over strings x1…xn for a Turing machine TM:
Mediation: there exists ϕ1,…,ϕn such that (Λ,ϕ) has minimal length to within ϵ bits subject to (∀i: TM(Λ,ϕi) = xi).
Redundancy: For all i, there exists ϕ′i such that (Λ,ϕ′i) has minimal length to within ϵ′ bits subject to TM(Λ,ϕ′i) = xi.
Further, suppose that some string Λ′also satisfies those conditions over the same strings x1…xn and Turing machine TM. Then:
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
In other words: a short program exists to compute Λ from Λ′, and a short program exists to compute Λ′ from Λ.
Conceptually, we imagine two different agents each look for approximately-best compressions in the same form as the mediation and redundancy conditions. One agent finds Λ, one finds Λ′. Then the two have found approximately isomorphic compressions.
Crucially, these are approximately isomorphic “parts of” compressions - i.e. they separate out “signal” Λ from “noise” ϕ, and the two agents approximately agree on what the “signal” and “noise” each have to say about the data. So, they’ll approximately agree on what it would mean to sample new data points with the same “signal” but different random “noise” - e.g. the new drink-instance which the bartender prepares for the customer.
Conceptually, both the questions we tackle here and the methods used to answer it are similar to our semantics post, with two big differences. First, we’re now working with minimum description length rather than generic Bayesian probability, which means everything is uncomputable but conceptually simpler. Second, our method here for “multiple drink types” was actually totally different and non-analogous to the corresponding method in the semantics post, even though the two methods used similar low-level tools (i.e. natural latents).[10] One item for potential future work is to transport the method used in the semantics post over to the minimum description length context, and the method used in this post over to the generic Bayesian context. Another potential item is to unify the two; intuitively they seem closely related.
Our main hope is that other people will be able to pick up these methods and use them to characterize other kinds of concepts which different minds are able to converge upon and communicate about. We’ve only just scratched the surface here.
Appendix: Proof of the Fundamental Theorem
We’re going to use essentially the same diagrammatic proof as Deterministic Natural Latents (don’t worry, we don’t assume you’ve read that). Of course, now we’re working with Kolmogorov complexity, so we have a new interpretation of each diagram, and we’ll need to prove all the diagrammatic rules we need to use.
Interpretation & Manipulation of Diagrams
Let’s start with an example diagram:
We say that this diagram “holds for” three strings X, Y, and Z iff
ϵ≥K(X)+K(Y|X)+K(Z|X)−K(X,Y,Z)
where ϵ will often be a big-O term.
More generally, a diagram is a directed acyclic graph in which each node i is a string Xi, along with an error term ϵ. The diagram holds iff
ϵ≥∑iK(Xi|Xpa(i))−K(X)
where pa(i) denotes the parents of node i in the graph.
Note that the same string can appear more than once! For instance, the diagram
says that ϵ≥K(X)+K(Y|X)+K(Y|X)−K(X,Y,Y). Using the chain rule, that simplifies to
O(logK(X,Y))+ϵ≥K(Y|X) i.e. Y has only logarithmic complexity given X (assuming the log term dominates ϵ). We’ll use diagrams like that one several times below, in order to indicate that one string has quite low complexity given another.
Now for some rules. We’ll need three:
The Dangly Bit Lemma
Re-Rooting of Markov Chains
Marginalization in Markov Chains
The Dangly Bit Lemma
If Y←X→Y holds to within ϵ bits, and any other diagram D involving X holds to within ϵ′ bits, then we can create a new diagram D′ which is identical to D but has another copy of Y (the “dangly bit”) as a child of X. The new diagram D′ will hold to within ϵ+ϵ′ bits.
Proof (click to expand):
Assume the diagram D is over X and some other variables Z; it asserts that
ϵ′≥∑iK((X,Z)i|(X,Z)pa(i))−K(X,Z)
Since Y←X→Y holds to within ϵ bits,
O(log K(X,Y)+ϵ)≥K(Y|X)
(as shown in the previous section). Let’s add those two inequalities:
which is the inequality asserted by the new diagram D’.
Re-Rooting of Markov Chains
Suppose we have a “Markov chain” diagram X1←X2←…←Xk→…→Xn−1→Xn to within ϵ bits. That diagram asserts
ϵ≥K(Xk)+∑i<kK(Xi|Xi+1)+∑i>kK(Xi|Xi−1)−K(X)
Then we can move the root node right one step to form the diagram X1←X2←…←Xk+1→…→Xn−1→Xn to within ϵ+O(log K(Xk,Xk+1)) bits. Likewise for moving left one step.
Proof (click to expand):
We’ll do the proof for moving right one step, the proof for moving left one step works the same way.
By applying this rule repeatedly, we can arbitrarily re-root a Markov chain diagram.
Marginalization in Markov Chains
Suppose the diagram W→X→Y→Z holds to within ϵ bits. Then W→Y→Z holds to within ϵ+O(log K(W,X,Y,Z)) bits.
Proof (click to expand):
ϵ≥K(W)+K(X|W)+K(Y|X)+K(Z|Y)−K(W,X,Y,Z)
≥K(W)+K(X|W)+K(Y|W,X)+K(Z|Y)−K(W,X,Y,Z)−O(1)
=K(Z|Y)−K(Z|W,X,Y)−O(log K(W,X,Y,Z))
≥K(Z|Y)−K(Z|W,Y)−O(log K(W,X,Y,Z))
=K(W)+K(Y|W)+K(Z|Y)−K(W,Y,Z)−O(log K(W,X,Y,Z))
This proof also extends straightforwardly to longer Markov chains; that’s a good exercise if you’re looking for one.
When working with a longer chain X1→…→Xn, note that reversing the direction of arrows in the Markov chain and further marginalization both also cost no more than O(log K(X)), so we can marginalize out multiple variables and put the root anywhere for the same big-O cost in the bounds.
Proof of the Fundamental Theorem
With all the foundations done, we’re now ready for the main proof. Here’s the full statement again:
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Mediation: len(Λ,ϕ1,…,ϕn) ≤ϵ+minΛ,ϕ1,…,ϕn len(Λ,ϕ1,…,ϕn) subject to (TM(Λ,ϕi) = xi for all i)
Redundancy: For all i, (Λ′,ϕ′i)≤ϵ′+ minΛ′,ϕ′i len(Λ′,ϕ′i) subject to TM(Λ′,ϕ′i) = xi.
Then: O(ϵ+ϵ′+log K(x))≥K(Λ′|Λ).
First, we’ll show that the mediation condition implies
and the redundancy condition implies
Following those two parts of the proof, the rest will proceed graphically.
Mediation -> Graph
Note that the mediation condition can always be satisfied to within O(n log n)[11] by taking Λ to be a shortest program for x plus O(1) code to index into the result at position ϕi, and taking ϕi=i. So:
1: If (Λ,ϕ) are to be minimal to within ϵ, they must satisfy len(Λ,ϕ) ≤K(x)+ϵ+O(log K(x))[12]
Furthermore, for (Λ,ϕ) to be minimal to within ϵ for anything:
2: (Λ,ϕ) must themselves be incompressible to within that same ϵ: len(Λ,ϕ) ≤K(Λ,ϕ)+ϵ+O(log K(Λ,ϕ)) = K(x)+ϵ+O(log K(x)). As a corollary, Λ and ϕ must individually be incompressible to within that same bound or else together they couldn’t be minimal either.
Additionally note that ϕi (plus some constant-size wrapper code) specifies xi given Λ, so:
Now we put all that together. (2) Λ is incompressible to within ϵ+O(log K(x)) and (1) Λ, ϕ together are a shortest-to-within ϵ+O(log K(x)) program for x:
K(Λ,x)≥ len(Λ) +∑i len(ϕi) −O(ϵ+log K(x))
and:
≥K(Λ)+∑iK(xi|Λ)−O(ϵ+logK(x))
since \Lambda specifies Λ and (3) ϕi specifies xi given Λ.
Redundancy -> Graph
Lemma: suppose Y is a minimal-to-within ϵ+O(log K(x)) program which outputs X. Then by the chain rule
Note that we’re mainly interested in the case where individual datapoints are large/rich. The O(nlogn) overhead from the indexing is fine, since it doesn’t scale with the complexity of the individual datapoints.
Independent in the minimum description sense, i.e. length of joint shortest description is approximately the sum of lengths of separate shortest descriptions
Reminder again that this is independence in the minimum description length sense, i.e. length of joint shortest description is approximately the sum of lengths of separate shortest descriptions
Note that we haven’t talked about how approximation plays with uniqueness of the partition, if there’s ambiguity about how “approximately unique” different strings are.
We have established that the uniqueness of natural latents in each subset plays well with approximation. Furthermore, note that the uniqueness proof of a natural latent can use any two data points over which the latent is natural, ignoring all others. So the natural latent string itself is in some sense quite robust to classifying a small number of data points differently, as long as there are plenty of points in each class, so we typically expect a little variation in different agents' partitions to be fine.
We can allow for different Turing machines easily by incorporating into ϵ, ϵ′ the bit-cost to simulate one machine using the other. That’s not a focus in this post, though.
Here we search for a natural latent (in the compression sense) across subsets of data points. In the semantics post, we looked for natural latents (in the bayesian sense) across features of individual data points.
The O(log K(x)) term is to allow for the possibility that the lengths of the strings need to be passed explicitly. This is a whole thing when dealing with K-complexity. Basically we need to include an O(log K(x)) error term in everything all the time to allow for passing string-lengths. If you really want all the details on that, go read the K-complexity book.
A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.
What’s up with that?
The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don’t, with at least enough agreement that the customer’s drink-type matches the customer’s expectations. And the two inductors reach that agreement without learning the category from huge amounts of labeled data - one inductor points at an instance, another inductor points at another instance, and then the first inductor gets the kind of drink it expected. Why (and when) are the two inductors able to coordinate on roughly the same categorization?
Most existing work on Solomonoff inductors, Kolmogorov complexity, or minimum description length can’t say much about this sort of thing. The problem is that the customer/bartender story is all about the internal structure of the minimum description - the (possibly implicit) “categories” which the two inductors use inside of their minimal descriptions in order to compress their raw data. The theory of minimum description length typically treats programs as black boxes, and doesn’t attempt to talk about their internal structure.
In this post, we’ll show one potential way to solve the puzzle - one potential way for two minimum-description-length-based minds to coordinate on a categorization.
Main Tool: Natural Latents for Minimum Description Length
Fundamental Theorem
Here’s the main foundational theorem we’ll use. (Just the statement for now, more later.)
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Then: the K-complexity of Λ′ given Λ,K(Λ′|Λ), is approximately zero - in other words, Λ′ is approximately determined by Λ, in a K-complexity sense.
(As a preview: later we’ll assume that both Λ and Λ′ satisfy both conditions, so both K(Λ′|Λ) and K(Λ|Λ′) are approximately zero. In that case, Λ and Λ′ are “approximately isomorphic” in the sense that either can be computed from the other by a short program. We’ll eventually tackle the customer/bartender puzzle from the start of this post by suggesting that Λ and Λ′ each encode a summary of things in one category according to one inductor, so the theorem then says that their category summaries are “approximately isomorphic”.)
The Intuition
What does this theorem mean intuitively?
Let’s start with the first condition: (Λ,ϕ1,…,ϕn) is an approximately-shortest string such that (TM(Λ,ϕi) = xi for all i). Notice that there’s a somewhat-trivial way to satisfy that condition: take Λ to be a minimal description of the whole dataset {xi}, take ϕi=i, and then add a little bit of code to Λ to pick out the datapoint at index ϕi[2]. So TM(Λ,ϕi) computes all of {xi} from Λ, then picks out index i. Now, that might not be the only approximately-minimal description (though it does imply that whatever approximately-minimal Λ,ϕ we do use is approximately a minimal description for all of x). Conceptually, insofar as there’s information in xi which is irrelevant to all the other xj’s, we could “move” that information from Λ into ϕi. But any information which is relevant to more than one xi has to be in Λ.
Now let’s consider the second condition: for all i, (Λ′,ϕ′i) is an approximately-shortest string such that TM(Λ′,ϕ′i) = xi. Again, there’s a somewhat-trivial way to satisfy the condition: take Λ′ to be empty, and ϕ′i to be a minimal description of xi. Again, that might not be the only approximately-minimal description. Conceptually, insofar as there’s information which is redundantly represented across all the xi’s, that information can be “moved” into Λ′. But Λ′ can only contain information which is redundantly represented across all xi’s.
Put those two together, and we get a simple intuitive statement of the theorem: if Λ contains all information which is relevant to more than one datapoint, and Λ′ contains only information which is redundantly represented across all datapoints, then Λ contains all the information in Λ′.
Intuitively: imagine that Λ is a pipe between xi and xj, and the only way for information to move between xi and xj is through that pipe. Then if some information Λ′ is present in both xi and xj, it must have gone through the pipe. It’s the same idea as deterministic natural latents, but now in a minimum description length setting.
So that’s the idea. Now we’re ready for the actual math.
Background: Working with Kolmogorov Complexity
We kicked around the word “approximation” a lot, in stating the theorem. Approximation in what sense?
When working with Kolmogorov complexity, a prototypical theorem looks like:
As an example: if Y is a shortest program to compute X, then
K(Y|X)≤O(log K(X)) [3]
Here’s a rough sketch of the program of size O(log K(X)) which outputs Y on input X. The program will have K(X) hardcoded (which takes O(log K(X)) bits). It will also have hardcoded a number k, such that among inputless programs of size K(X) which output X, Y is the kth to halt. It turns out that there can’t be very many minimal programs (though we won’t prove that part here), so log k≤O(log K(X)). The program itself runs all programs of length K(X) until the kth halts and returns X, then returns the kth program to halt and return X; that code requires only a constant number of bits to specify. So, the total size of the program is O(log K(X)) bits.
That’s what the programs involved in prototypical K-complexity theorems tend to look like. The key idea to remember is that, wherever you see O(blah) terms, that usually means there’s some simple wrapper code of the relevant size.
More Precise Statement of the Fundamental Theorem
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Then: O(ϵ+ϵ′+log K(x)) ≥K(Λ′|Λ).
The proof has been relegated to the appendix.
Implications: Minimality, Maximality, Isomorphism
When a string Λ satisfies both the mediation and redundancy conditions over some strings x1,…,xn, we call Λ a natural latent over x1,…,xn (overloading our terminology for "natural latents" in the probabilistic setting). Equipped with the fundamental theorem, we can now carry over all the standard properties of deterministic natural latents in the probabilistic setting to natural latents in the minimum description length setting.
Minimality
Suppose the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Then Λ satisfies the mediation condition, and for any other string Λ′ which satisfies the mediation condition to within ϵ bits, the Fundamental Theorem tells us
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
i.e. there exists a short program to compute Λ from Λ′. So Λ is a “minimal” string which satisfies the mediation condition; it can be computed from any other via a short program.
Maximality
Suppose, again, that the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Then Λ satisfies the redundancy condition, and for any other string Λ′ which satisfies the redundancy condition to within ϵ′ bits, the Fundamental Theorem tells us
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
i.e. there exists a short program to compute Λ′ from Λ. So Λ is a “maximal” string which satisfies the redundancy condition; any other can be computed from Λ via a short program.
Isomorphism
Finally, suppose that both the string Λ and the string Λ′ satisfy both mediation and redundancy to within ϵ and ϵ’ bits, respectively. Then
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
and
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
So either of the strings Λ,Λ′ can be computed from the other via a short program. In that sense, the two are “isomorphic”.
Main Takeaway
The main upshot which we’ll use in the next section is the isomorphism property, so we’ll restate it here in full.
Suppose the string Λ satisfies both the mediation and redundancy conditions over some strings x1…xn:
Assume that Λ′ also satisfies both conditions:
Intuitively, these say that both Λ and Λ′ capture approximately all the information which is relevant to more than one xi, and approximately only the information which is redundantly represented across all xi.
Then:
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
In other words: a short program exists to compute Λ from Λ′, and a short program exists to compute Λ′ from Λ.
Back to Solomonoff’s Bar
Let’s talk about the problem faced by our two Solomonoff inductors - the customer and the bartender.
As a (very) simplified model, let’s say that they both start out with matching background models of what generally happens when one places an order in a bar - i.e. certain words are spoken, a liquid shows up in a container, all that jazz. We’ll assume that the two already break up the world into a bunch of individual instances of “drinks” (a pint on the bar over here, a glass on the table over there, …) all of which are observed by both inductors; that will be our most egregious simplification of the problem. We’ll call those individual drink-instances in the world x1,…,xn. The (very simplified) problem the two then face is how to categorize various physical instances of liquids-in-containers, such that when the customer asks for a drink in a certain category (i.e. “one of those” + pointing), they end up with the kind of drink they expected.
By contrast, what would be a bad outcome for the two? Well, maybe the bartender just categorizes drinks by booze content, and the customer just categorizes drinks by color, so the customer thinks they’re asking for a blue drink but the bartender thinks they’re asking for a drink with 10% alcohol content, and the customer unexpectedly-to-them ends up with a pink drink. That’s the sort of thing we want to avoid.
So the customer and the bartender face a coordination problem - both want the customer to not be surprised at their drink. And it’s a coordination problem in the Schelling sense, insofar as the two will not be able to exchange anywhere near enough data to brute-force learn each others’ categorizations.
How can the two use the theorems here to coordinate on a choice of categorization, without exponentially huge amounts of communication? We’ll tackle the problem in two steps:
Even Simpler Subproblem: Just One Drink Type
Suppose there’s just one drink type - maybe this is soviet Russia, and a shot of vodka is the only option. We still have a version of the customer/bartender problem: they should both roughly agree on what a shot of vodka is. If the bartender prepares a thing which they think of as a shot of vodka, the customer should not receive a thing which they think of as, say, a giraffe.
Well, both inductors see x1,…,xn, which are all shot-of-vodka-instances. Suppose they both look for a natural latent over those instances, and both find one: Λ for the customer, Λ′ for the bartender. Then we know that the two have found approximately-isomorphic strings.
Now, the bartender creates a new thing xn+1, which is approximately-optimally compressed as passing a new incompressible string (i.e. noise) to the generator Λ′, just like the bartender models all the other xi’s as having been generated. Λ′ is therefore natural over x1,…,xn+1. And since Λ′ is approximately isomorphic to Λ, Λ is natural over x1,…,xn+1. So, when the customer receives their drink, they see that it’s approximately-optimally compressed by Λ plus some incompressible noise, and it’s also approximately-optimally compressed jointly with x1,…,xn by Λ plus some incompressible noise. It is, in other words, basically what the customer expected.
So that’s the basic story. Now let’s kick it up a notch, and handle multiple drink-types.
Multiple Drink Types: Categorizing by Shared Natural Latent
Our two inductors both start out with the “data points” x1,…,xn, each of which is one physical instance of a drink. Then one obvious move is to look for a natural latent over these strings, or over various subsets of the strings.
In general, there’s no guarantee that any natural latent exists at all (to within reasonable approximation). And even if natural latents do exist over some subsets of the data points, they might have some complicated structure - e.g. maybe there’s a natural latent between x1 and x2 and another between x2 and x3 but those two have very different stuff in them, or maybe there’s hierarchical structure like e.g. a natural latent over the first half of the data and another over the second half and then a natural latent over those two natural latents, or ???. Most of those possibilities don’t look much like “categories of drinks”, at least not in a simple sense. But perhaps our intuitive notion of simple “categories” maps to a specific kind of natural latent structure?
Here’s a simple structure which seems like a decent fit.
Imagine that we have 8 drink-instances x1,…,x8. We find a natural latent Λa between x1 and x2. On further investigation, we find that the same string Λa is natural over all of x1,x2,x5, and x7.[4] Then we find another string Λbwhich is natural over x3 and x4, and a third string Λc which is natural over x6 and x8. Finally, we find that the strings {x1,x2,x5,x7} are independent[5] of the strings {x3,x4} and all those are independent of the strings {x6,x8}.
We can summarize all that with the diagrams:
where we use the “⊥” symbol to denote naturality, and unconnected diagrams next to each other indicate independence.
In this case, it makes a lot of intuitive sense to view each of the three latents as a generator or specification for a certain category of drink. There is no compressible shared information across drinks in different categories, and Λα captures all-and-only the information shared across drinks with category α, for each category α.
If such Λ’s exist, how can the customer and bartender use them to solve their coordination problem?
Well, assume they’re both looking for Λ’s with this structure - i.e. they both want to partition the set of drinks such that there exists a nonempty natural latent over each subset, and independence[6] across subsets. If that’s possible at all, then the partition is unique: pairwise independent strings must go in different subsets, and pairwise non-independent strings must go in the same subset. (If we consider a graph in which each node is one of the strings xi and two nodes are connected iff they are non-independent, then the connected components of that graph must be the subsets of the partition.)[7]
With the partition nailed down, we next invoke uniqueness (up to isomorphism) of natural latents within each subset, i.e. the “isomorphism” result we proved earlier.[8]
As long as the customer and bartender are both looking for this kind of structure in the strings x1,…,xn, and the structure exists to be found in the strings, they will find the same partition (modulo approximation) and approximately isomorphic latents characterizing each class of drinks.
So the customer walks into the bar. They point at a drink - let’s say it’s x7 - and say “One of those, please.”. The bartender has x7 in subset a, with generator Λa. Just to double check, they point to a picture of a drink also generated by Λa, and the customer confirms that’s the type of drink they want. The bartender generates a drink accordingly. And since the customer and bartender have approximately isomorphic Λa, the customer gets what they expected, to within reasonable approximation.
Formal Specification
Both inductors look for
such that
Note that all of these might be conditional on other background knowledge of the inductors - i.e. the Turing machine TM has access to a database full of all the other stuff the inductors know. However, they do need to condition only on things which they expect the other inductor to also know, since we’ve assumed the two inductors use the same Turing machine.[9]
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer. Furthermore, our current best guess is that most of that background knowledge is also itself organized via natural latents, so that the customer and bartender can also coordinate on which knowledge to condition on.
Conclusion & Takeaways
First, let’s recap the core theorem of this post.
Suppose that some string Λ satisfies both the following conditions over strings x1…xn for a Turing machine TM:
Further, suppose that some string Λ′ also satisfies those conditions over the same strings x1…xn and Turing machine TM. Then:
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ|Λ′)
O(ϵ+ϵ′+ log K(x1,x2))≥K(Λ′|Λ)
In other words: a short program exists to compute Λ from Λ′, and a short program exists to compute Λ′ from Λ.
Conceptually, we imagine two different agents each look for approximately-best compressions in the same form as the mediation and redundancy conditions. One agent finds Λ, one finds Λ′. Then the two have found approximately isomorphic compressions.
Crucially, these are approximately isomorphic “parts of” compressions - i.e. they separate out “signal” Λ from “noise” ϕ, and the two agents approximately agree on what the “signal” and “noise” each have to say about the data. So, they’ll approximately agree on what it would mean to sample new data points with the same “signal” but different random “noise” - e.g. the new drink-instance which the bartender prepares for the customer.
Conceptually, both the questions we tackle here and the methods used to answer it are similar to our semantics post, with two big differences. First, we’re now working with minimum description length rather than generic Bayesian probability, which means everything is uncomputable but conceptually simpler. Second, our method here for “multiple drink types” was actually totally different and non-analogous to the corresponding method in the semantics post, even though the two methods used similar low-level tools (i.e. natural latents).[10] One item for potential future work is to transport the method used in the semantics post over to the minimum description length context, and the method used in this post over to the generic Bayesian context. Another potential item is to unify the two; intuitively they seem closely related.
Our main hope is that other people will be able to pick up these methods and use them to characterize other kinds of concepts which different minds are able to converge upon and communicate about. We’ve only just scratched the surface here.
Appendix: Proof of the Fundamental Theorem
We’re going to use essentially the same diagrammatic proof as Deterministic Natural Latents (don’t worry, we don’t assume you’ve read that). Of course, now we’re working with Kolmogorov complexity, so we have a new interpretation of each diagram, and we’ll need to prove all the diagrammatic rules we need to use.
Interpretation & Manipulation of Diagrams
Let’s start with an example diagram:
We say that this diagram “holds for” three strings X, Y, and Z iff
ϵ≥K(X)+K(Y|X)+K(Z|X)−K(X,Y,Z)
where ϵ will often be a big-O term.
More generally, a diagram is a directed acyclic graph in which each node i is a string Xi, along with an error term ϵ. The diagram holds iff
ϵ≥∑iK(Xi|Xpa(i))−K(X)
where pa(i) denotes the parents of node i in the graph.
Note that the same string can appear more than once! For instance, the diagram
says that ϵ≥K(X)+K(Y|X)+K(Y|X)−K(X,Y,Y). Using the chain rule, that simplifies to
O(logK(X,Y))+ϵ≥K(Y|X)
i.e. Y has only logarithmic complexity given X (assuming the log term dominates ϵ). We’ll use diagrams like that one several times below, in order to indicate that one string has quite low complexity given another.
Now for some rules. We’ll need three:
The Dangly Bit Lemma
If Y←X→Y holds to within ϵ bits, and any other diagram D involving X holds to within ϵ′ bits, then we can create a new diagram D′ which is identical to D but has another copy of Y (the “dangly bit”) as a child of X. The new diagram D′ will hold to within ϵ+ϵ′ bits.
Proof (click to expand):
Assume the diagram D is over X and some other variables Z; it asserts that
ϵ′≥∑iK((X,Z)i|(X,Z)pa(i))−K(X,Z)
Since Y←X→Y holds to within ϵ bits,
O(log K(X,Y)+ϵ)≥K(Y|X)
(as shown in the previous section). Let’s add those two inequalities:
ϵ′+O(log K(X,Y)+ϵ)≥K(Y|X)+∑iK((X,Z)i|(X,Z)pa(i))−K(X,Z)
And since K(X,Z)≤K(X,Y,Z)+O(1), we can replace the K(X,Z) term to get
ϵ′+O(log K(X,Y)+ϵ)≥K(Y|X)+∑iK((X,Z)i|(X,Z)pa(i))−K(X,Y,Z)
which is the inequality asserted by the new diagram D’.
Re-Rooting of Markov Chains
Suppose we have a “Markov chain” diagram X1←X2←…←Xk→…→Xn−1→Xn to within ϵ bits. That diagram asserts
ϵ≥K(Xk)+∑i<kK(Xi|Xi+1)+∑i>kK(Xi|Xi−1)−K(X)
Then we can move the root node right one step to form the diagram X1←X2←…←Xk+1→…→Xn−1→Xn to within ϵ+O(log K(Xk,Xk+1)) bits. Likewise for moving left one step.
Proof (click to expand):
We’ll do the proof for moving right one step, the proof for moving left one step works the same way.
ϵ≥K(Xk)+∑i<kK(Xi|Xi+1)+∑i>kK(Xi|Xi−1)−K(X)
=K(Xk)+K(Xk+1|Xk)+∑i<kK(Xi|Xi+1)+∑i>k+1K(Xi|Xi−1)−K(X)
≥K(Xk,Xk+1|Xk)−O(1)+∑i<kK(Xi|Xi+1)+∑i>k+1K(Xi|Xi−1)−K(X)
≥K(Xk+1)+K(Xk|Xk+1)−O(logK(Xk,Xk+1))+∑i<kK(Xi|Xi+1)+∑i>k+1K(Xi|Xi−1)−K(X)
=K(Xk+1)+∑i<k+1K(Xi|Xi+1)+∑i>k+1K(Xi|Xi−1)−K(X)−O(logK(Xk,Xk+1))
By applying this rule repeatedly, we can arbitrarily re-root a Markov chain diagram.
Marginalization in Markov Chains
Suppose the diagram W→X→Y→Z holds to within ϵ bits. Then W→Y→Z holds to within ϵ+O(log K(W,X,Y,Z)) bits.
Proof (click to expand):
ϵ≥K(W)+K(X|W)+K(Y|X)+K(Z|Y)−K(W,X,Y,Z)
≥K(W)+K(X|W)+K(Y|W,X)+K(Z|Y)−K(W,X,Y,Z)−O(1)
=K(Z|Y)−K(Z|W,X,Y)−O(log K(W,X,Y,Z))
≥K(Z|Y)−K(Z|W,Y)−O(log K(W,X,Y,Z))
=K(W)+K(Y|W)+K(Z|Y)−K(W,Y,Z)−O(log K(W,X,Y,Z))
This proof also extends straightforwardly to longer Markov chains; that’s a good exercise if you’re looking for one.
When working with a longer chain X1→…→Xn, note that reversing the direction of arrows in the Markov chain and further marginalization both also cost no more than O(log K(X)), so we can marginalize out multiple variables and put the root anywhere for the same big-O cost in the bounds.
Proof of the Fundamental Theorem
With all the foundations done, we’re now ready for the main proof. Here’s the full statement again:
We have a set of n data points (binary strings) {xi}, and a Turing machine TM. Suppose we find some programs/strings Λ,{ϕi},Λ′,{ϕ′i} such that:
Then: O(ϵ+ϵ′+log K(x))≥K(Λ′|Λ).
First, we’ll show that the mediation condition implies
and the redundancy condition implies
Following those two parts of the proof, the rest will proceed graphically.
Mediation -> Graph
Note that the mediation condition can always be satisfied to within O(n log n)[11] by taking Λ to be a shortest program for x plus O(1) code to index into the result at position ϕi, and taking ϕi=i. So:
1: If (Λ,ϕ) are to be minimal to within ϵ, they must satisfy len(Λ,ϕ) ≤K(x)+ϵ+O(log K(x))[12]
Furthermore, for (Λ,ϕ) to be minimal to within ϵ for anything:
2: (Λ,ϕ) must themselves be incompressible to within that same ϵ: len(Λ,ϕ) ≤K(Λ,ϕ)+ϵ+O(log K(Λ,ϕ)) = K(x)+ϵ+O(log K(x)). As a corollary, Λ and ϕ must individually be incompressible to within that same bound or else together they couldn’t be minimal either.
Additionally note that ϕi (plus some constant-size wrapper code) specifies xi given Λ, so:
3: K(xi|Λ)≤len(ϕi)+O(log len(ϕi)) = len(ϕi) +O(log K(X_i | \Lambda))
Now we put all that together. (2) Λ is incompressible to within ϵ+O(log K(x)) and (1) Λ, ϕ together are a shortest-to-within ϵ+O(log K(x)) program for x:
K(Λ,x)≥ len(Λ) +∑i len(ϕi) −O(ϵ+log K(x))
and:
≥K(Λ)+∑iK(xi|Λ)−O(ϵ+logK(x))
since \Lambda specifies Λ and (3) ϕi specifies xi given Λ.
Redundancy -> Graph
Lemma: suppose Y is a minimal-to-within ϵ+O(log K(x)) program which outputs X. Then by the chain rule
K(X,Y)=K(X|Y)+K(Y)+O(log K(X,Y))≤0+K(X)+ϵ+O(log K(X))
Applying the chain rule in the other order:
K(X,Y)=K(Y|X)+K(X)+O(log K(X,Y))=K(Y|X)+K(X)+O(log K(X))
Equating and canceling, we find
K(Y|X)≤ϵ+O(log K(X))
Note that (Λ′,ϕ′i) together are a minimal-to-within ϵ+O(log K(x)) program for xi, so by the previous lemma:
K(Λ′|xi)≤K(Λ′,ϕ′i|xi)+O(log K(Λ′))≤ϵ+O(log K(xi))
Thus:
The Graphical Part
From here, we just directly carry over the proof from Deterministic Natural Latents, using the lemmas and bounds proven in this appendix.
That completes the proof.
Note that the quantifiers have importantly different scopes between these two conditions.
Note that we’re mainly interested in the case where individual datapoints are large/rich. The O(nlogn) overhead from the indexing is fine, since it doesn’t scale with the complexity of the individual datapoints.
Read K(Y|X) in this exampleas the length of the shortest program which, given X as input, returns the shortest program for computing X
Aside: if a string is natural over some strings y_1, …, y_n, then it’s also natural over any subset consisting of two or more of those strings.
Independent in the minimum description sense, i.e. length of joint shortest description is approximately the sum of lengths of separate shortest descriptions
Reminder again that this is independence in the minimum description length sense, i.e. length of joint shortest description is approximately the sum of lengths of separate shortest descriptions
Note that we haven’t talked about how approximation plays with uniqueness of the partition, if there’s ambiguity about how “approximately unique” different strings are.
We have established that the uniqueness of natural latents in each subset plays well with approximation. Furthermore, note that the uniqueness proof of a natural latent can use any two data points over which the latent is natural, ignoring all others. So the natural latent string itself is in some sense quite robust to classifying a small number of data points differently, as long as there are plenty of points in each class, so we typically expect a little variation in different agents' partitions to be fine.
We can allow for different Turing machines easily by incorporating into ϵ, ϵ′ the bit-cost to simulate one machine using the other. That’s not a focus in this post, though.
Here we search for a natural latent (in the compression sense) across subsets of data points. In the semantics post, we looked for natural latents (in the bayesian sense) across features of individual data points.
Note that this has O(1) dependence on K(x).
The O(log K(x)) term is to allow for the possibility that the lengths of the strings need to be passed explicitly. This is a whole thing when dealing with K-complexity. Basically we need to include an O(log K(x)) error term in everything all the time to allow for passing string-lengths. If you really want all the details on that, go read the K-complexity book.