Thanks for sharing this.
Would there be any advantages to substituting brute force search with metaheuristic algorithms like Ant Colony Optimization?
I do think that something could be done. I'm not very familiar with Ant Colony Optimization in particular, and it's not immediately obvious to me how it would be applied here. Specifically, how would it take advantage of the algorithmic complexity if that's all we know beforehand? Often, we're working with infinite parameter spaces where even reasonable subsets will grow exponentially large as we expand them. Is there something like a lazy version of Ant Colony Optimization which can be applied here?
I do think some kind of randomized, learning search would be helpful. I mentioned genetic programming and I think stochastic supercompilation like searches (a la Stoke) would likely work as well.
Introduction
This was originally posted here.
I've been researching, for quite some time, the prospect of machine learning on a wider variety of data types than normally considered; things other than tables of numbers and categories. In particular, I want to do ML for program and proof synthesis which requires, at the very least, learning the structures of trees or graphs which don't come from a differentiable domain. Normal ML algorithms can't handle these; though some recent methods, such as graph neural networks and transformers, can be adapted to this domain with some promising results. However, these methods still rely on differentiation. Is this really required? Are we forever doomed to map all our data onto a differentiable domain if we want to learn with it?
An alternative approach that has been bandied about for a while is the utilization of compression. It's not hard to find articles and talks about the relationship between compression and prediction. If you have a good predictor, then you can compress a sequence into a seed for that predictor and decompress by running said predictor. Going the other way is harder, but, broadly speaking, if you have a sequence that you want to make a prediction on and a good compressor, then whichever addition increases the compressed size the least should be considered the likeliest prediction. This approach is quite broad, applying to any information which can be represented on a computer and not requiring any assumptions whatsoever about the structure of our data beyond that. We could use this idea to, for example, fill in gaps in graphs, trees, sets of input-output pairs, etc.
It's important to understand what's actually required here. We don't actually need to compress our training data; we only need a way to estimate the change in minimal-compression-size as we add a prediction. This minimal-compression-size is called the Kolmogorov Complexity, denoted
K(X)
. The minimal-compression-size of a program which outputsX
on an inputY
is called the Conditional Kolmogorov Complexity, denotedK(X|Y)
. The topic of Kolmogorov Complexity is quite broad, and I won't explain all its complexities here. A standard introduction is the textbook An Introduction to Kolmogorov Complexity and Its Applications by Li, Ming, Vitányi, and Paul. If we have a good method for calculatingK
, then we don't need to actually make use of compression.Making this practical is quite hard and under-researched, and there aren't many papers on the topic. But there is this; Algorithmic Probability-guided Supervised Machine Learning on Non-differentiable Spaces which reproduces some standard ML applications using this approach. I want to understand how doing ML this way works, and this post will basically be a collection of the notes I made while reading the paper. If I refer to "the paper", "this paper", etc. in this post, this is what I'm referring to. These notes will digress quite often and I'm also quite critical of some aspects of the paper. This post was also written somewhat like a stream of consciousness, so I'll often say something which I correct later on. This post isn't intended to merely summarize the paper, but to describe what I learned and thought as I read it. Hopefully, you'll learn stuff too.
Why Not Use Ordinary Compression?
One of the most common suggestions for approximating
K(X)
is to simply use an already existing compression algorithm. The problem is that most "optimal" compression algorithms such as arithmetic encoding are only optimal up to the Shannon Entropy of the data. That is if we assume the data is sampled randomly from a distribution, the best we can do is estimate the shape of this distribution and give shorter encodings appropriately to more likely symbols. This is, asymptotically, about the same as counting substring occurrences to reduce redundancy. If our data is actually just randomly sampled, then this is great! But the real world isn't like this. Most real-world data can be construed as an essentially deterministic process with some added noise. Most compression potential comes from modeling this underlying process, not the noise.Consider the sequence;
This is, obviously, very compressible. An optimal (truly optimal, not Shannon entropy-optimal) compressor would be able to compress this into a program producing this output. Maybe
Range@1000
, or something even smaller, depending on what language it's using. But statistical compression will just try to find repetitive substrings. Even if we represent this list in binary and compress, statistical methods won't be able to compress this much better than a truly random string since there are few repetitious patterns.There are lots of natural examples of this. Compressing the digits of π, compressing the coordinates of regular geometric figures, compressing a list of positions for a simple physical system simulation. It's obvious that these can have small algorithmic complexity, that they should be compressible into small programs that generate them, and yet statistical compression methods won't be able to take advantage of this.
As a consequence, we must use compression methods that do something more sophisticated than statistical compression. Unfortunately, essentially all general-purpose compression methods are like this. There are some ML-based methods that probably aren't. A lot of the text-compression algorithms which participated in the Large Text Compression Benchmark use RNNs which are definitely doing something other than statistical compression.
Much of the paper is dedicated to explaining one method for approximating Kolmogorov Complexity. Kolmogorov Complexity isn't computable, and getting a good approximation is very hard. Some clever methods have been devised, but we can't get a good grasp of it as easily as we can perform statistical compression.
Methods for Approximating Kolmogorov Complexity
We, ultimately, need a way to approximate Kolmogorov complexity. The learning methods themselves should be largely independent of this, but choosing a method is essential for real-world applications. Here are a few methods I've found in the literature;
CTM - Coding Theorem Method
This is the method the paper endorses so I'll talk about it in more detail later on.
The idea is to enumerate all strings of a given length and run them as programs for some chosen model. We collect all the input-output pairs to a database. We then use the coding theorem to justify using this to estimate
K
. In particular, if we add together2^-l(p)
, wherel(p)
is the length ofp
, for all programsp
which outputX
, that will get us an estimate forK(X)
. This is basically what the coding theorem says, hence the name of the method.BDM - Block Decomposition Method
This utilizes an existing CTM database to estimate Kolmogorov Complexity. It first tries finding algorithmically compressible substrings using CTM and then uses that information in conjunction with a Shannon entropy like calculation to estimate the complexity of the whole string. For small strings, BDM is close to the performance of CTM, for large strings its average-case performance is close to statistical compression. Many large strings in practice, however, tend to be compressed better than with statistical methods.
See:
List Approximation
List Approximation is based on optimizing a simple observation. While generating the smallest program generating
X
is not computable, generating a list guaranteed to contain the smallest program is. In particular, we can return a list enumerating all strings below and containingX
. This will definitely have the smallest program generatingX
, but it will be exponentially large in the length ofX
. How small can this list be?Short lists with short programs in short time (also an improved version in Short lists for shortest descriptions in short time) show that this list can be made quadratically large (and asymptotically no smaller) in the length of the input while guaranteedly containing the smallest program. This makes searching for
K(X)
much more practical as we only need to run a number of programs quadratic in the size ofX
.If we are willing to accept only approximating
K
with a list, we can ensure anO(log(|X|))
penalty to our smallest generating program and make the list linear in the size ofX
, as shown in Linear list-approximation for short programs.These methods seem promising, but the algorithms themselves are quite abstruse and some require exponential space, making them impractical. However, improvements may be possible.
It's unclear how much labor is actually saved when using the approximation lists. It may be the case that both the smallest possible representations of programs and everything else in the list requires an absurd amount of work to normalize. It may remove those programs which were already easy to dismiss when using brute-force while exclusively keeping the ones that are hard to assess anyway. The lists may also only have the smallest program which is hard to assess. If there's no second-best approximation to
K
, then we're stuck having to find the actual smallest value with no backup if that's impractical to know. Without any practical demonstrations, it's hard to know if these are genuine problems.Universal Almost Optimal Compression
This method is based on a generic property of compression-decompression pairs. As it turns out, we can, while incurring polylogarithmic overhead in the size of the compressed string, replace a (potentially non-computable) compression algorithm and its decompressor with a pair consisting of an efficient compressor and a potentially inefficient decompressor. By fixing our compressor-decompressor pair to be
K
andE
(the function that simply evaluates a program), we can get a new compression-decompression pair that will compress inputs to a length which differs, at most, polylogarithmically fromK
. This compressor would not get us smaller, equivalent programs, but, if our goal is to simply approximate the size of a hypothetical Kolmogorov-compressed program, this should work fine.The basic idea is the following; rather than trying to find the actual smallest compressed string, instead only generate a small "fingerprint" which would allow you to identify what the smallest compressed string might be. The decompressor then just generates a list of candidates, perhaps exhaustively, and uses the fingerprint to find and run it by brute force.
Depending on how much work we're willing to put into making this fingerprint, we can get it down to a pretty small size. According to the paper, it can be made within
K(X) + O(log(X))
with only polynomial effort in the size of the string.I don't fully understand how this technique works. It's tied up in a lot of the theory that List Approximation uses as well. It's concepts come from the theory of pseudorandomness; something I'll have to become more familiar with. See
Incremental compression
Instead of calculating
K(X)
all at once, it can usually be done piecemeal. The idea is that, given some inputX
, we want to find a pair of functionsF
,D
, such thatF(D(X)) = X
and|F| + |D(X)| < |X|
. Specifically, we want to find the smallestF
meeting this requirement. The idea is thatD(X)
reduces the size ofX
, deleting whatever information is inF
fromX
.F
is then that information, isolated fromX
. By repeating this over and over again, we can decomposeX
into a seriesF1(F2(F3(...(R))))
, whereR
is the residual which wasn't compressed. In the limit,R
should basically consist of all the random information present inX
, while theF
s correspond to algorithmic "features" which can be isolated fromX
. So long as theF
s are always as small as possible, this construction will approach the actual Kolmogorov complexity.I think this line of work hints towards a rich theory of "atomic" algorithmic information, but it's not ready for practical applications as of yet. The incremental compression is not computable, but it should be much quicker to approximate, on average, than
K(X)
while still approachingK(X)
.See:
Higher-order compression
This is a method of compressing lambda expressions by observing a connection between grammar-based compression and lambda binding.
The procedure is very simple. Start with a lambda expression.
Compress the expression using a tree-grammar (using re-pair, for instance). Convert this tree grammar back into a lambda expression.
Run a "simplification procedure" which performs
Repeat until the expression stops shrinking.
I honestly have a hard time believing this works. I'll have to think about it more carefully. To me, this doesn't seem like it should perform better than statistical compression, but, according to the paper Functional Programs as Compressed Data;
I don't buy the argument given in the paper, though, which just seems to argue that optimal compression should be possible in theory; it doesn't even mention the specifics of the algorithm they present. None the less, I want to include this here since it makes a specific and relevant claim. Some followup work seems to be doing something more computationally interesting, such as Compaction of Church Numerals for Higher-Order Compression, so a future version of this might be better suited for the task at hand.
Similar grammar-based methods should work for other structured models of computation. For example, using re-pair for graphs as presented in Grammar-Based Graph Compression, a version should be possible for interaction nets.
Approximating Conditional Kolmogorov Complexity using CTM
While those methods allow approximating
K(X)
, we usually actually want to approximateK(X|Y)
, the amount of information stored inX
but not inY
; the amount of informationX
tells us if we already knowY
. The paper tries giving a method for doing this, but the method seems very questionable. It says to do the following;M : x → y
.ys
can be enumerated on any inputx
, but I don't know.M
is also described as being a "Turing complete space", in the typical case (e.g. when using CTM).M
cannot be both a space and a relation, so clearly space is meant in some loose sense, but it's unclear what a "Turing complete space" is supposed to be. I interpret this as meaning thatM
is supposed to be a relation from programs to outputs in the typical case, which is a function, not a relation. But this framing implies thatM
could be broader. PerhapsM
may be a relation in the case of a nondeterministic computation model, but this is not expanded upon in the paper.P
, such that(y, x) ∈ P
iffy ∈ M(x)
P
would be the database in CTM of output-input pairs.K(X|Y)
by taking thelog₂
of the sum, for all(Y, X) ∈ P
, of1/|P|
.The problems start when we realize that, since
P
is a set, any pair occurs inP
at most once, meaning that this value islog₂ (1/|P|)
if(Y, X)
is in the database or-∞
if it isn't. This is obviously not what's intended, but I also can't glean from the context what is. Furthermore, the full expression given in the paper is;Both
X
andY
are bound twice, once in defining CTM and once by the sum itself. It seems like the second bind is trying to reference the first, but that makes no sense, syntactically. Alternatively, if we interpret the binders as entirely separate, then CTM does nothing with its arguments, and just returns0
on all inputs (sincelog₂ |P|/|P| = 0
), which is obviously wrong.The simplest fix is to simply make
P
a multiset which may contain multiple copies of any given pair. The calculation should then be;This may be off, but this is the closest thing to the original paper I could think of. It just calculates the log-likelihood of a random program covered by
P
outputingx
on inputy
. Except, there are a few problems with this. Firstly, this quantity will always be negative sincelog(X) < 0
when0 < X < 1
.K
can never be negative. Even if we fix this, there's still a bigger issue. This CTM definition assumes that all programs are uniformly random, but they aren't. Think of the procedure we'd go through when generating a random program. Assuming our choices are uniformly distributed, the programs we generate won't be. If we assume our programs are binary strings, then we will, at each point, make one of three choices; add a0
, add a1
, or end the string. If each choice is uniformly sampled, then there will be a one third chance of generating the empty string, a one in nine chance of generating1
, and about a one in 60,000 chance of generating001101001
. The chance of generating a program decays exponentially with the length of the program. This observation is built into algorithmic probability, and it's weird that the CTM measure, as described in this paper, ignores that.Digressing a bit, I feel like the authors may have some over-familiarity with one specific model of computation. One approach to define
M
used by the Complexity Calculator project is to use an enumeration for Turing machines which, to my knowledge, was originally devised for investigating busy beaver numbers. I believe that the authors are imagining thatM
as a function that enumerates all Turing machines, runsx
on them all, and outputs a stream of all theys
that each Turing machine outputs. This will certainly be a function rather than some generic relation, though.Let's think of what this measure means for other models of computation. If we were using, say, lambda expressions instead,
M
should enumerate all lambda expressions, take another lambda expression as input, and output the normal forms of the input applied to all possible lambda expressions. This does seem like it makes sense for any model of computation, but I'm not sure it makes sense as a measure of algorithmic similarity.The justification for this procedure is supposed to come from the coding theorem, which states that
K(X) + O(1) = -log₂(m(X))
, wherewhere
p ↓ X
meansp
normalizes toX
andl(p)
is the length ofp
. There's that exponential decay I was talking about.See:
Modifying this for lambda expressions,
where
l ↓ X
meansl
normalizes toX
andI(l)
measures the bit information ofl
, essentially the number of binary decisions made when constructingl
.I(l)
would be calculatedIncidentally, the length of a binary string doesn't actually give the information content of that string. If a string's length isn't fixed beforehand, then each additional digit incurs one trit of information since at each stage of the construction we are choosing between one of three options;
0
,1
, or stop constructing the string. From this, we can conclude thatl(s) = log₃(2 ^ -I(s)) - 1
; that is, the length of a string is one less than the number of trits in that string. If the string's length is fixed beforehand, if we cannot choose to end the construction of a string at our leisure, then each choice is actually binary andl(s) = I(s)
.I think that using
I(s)
to calculate the information rather than the length is more theoretically correct than the usual expression in terms of length. It doesn't seem to matter too much in the case of strings because the sum over all2 ^ (-l(s)-1)
= the sum over all2 ^ -I(s)
=1
, so both are valid ways of making a probability distribution over all programs with a similar exponential decay. That-l(p)-1
is there so that the empty string isn't given 100% of the distribution. The real problem is generalizability; the length calculation generally fails to make a coherent distribution if our computation model no longer accepts binary strings as inputs. The information, however, can always be adapted even if our computational model expects programs to be something esoteric, like graphs, such is the case with interaction nets.As a side note, despite length being theoretically incorrect, it's been used in some papers for measuring the information of a lambda expression. See Computable Variants of AIXI which are More Powerful than AIXItl for instance. But it seems like the theoretically wrong thing to do, especially since the actual information is so easy to calculate. I think many authors in this field don't think too carefully about the information content of the things they write about, which is quite ironic.
The conditional coding theorem states that;
where
See:
This is definitely not what that CTM measure is approximating. Because the original in the paper is so obviously wrong, and the nature of
M
is so poorly explained, it's hard to patch it up to whatever the author's intended. In fact, I'm not sure this is actually possible. The conditional coding theorem relies on the length of the program,p
, whichY
is being fed into. This would require us to incorporate the complexity of the Turing machine itself, butP
doesn't store that information.Let me try to offer a more sensible formulation of the CTM idea. Assume a computing function
M : x → y
which simply evaluates an input program into an output using a fixed computational model. LetP
be a finite subset of output-input pairs(y, x)
. The input type should satisfy the smn theorem, so we can format programs likef(x)
; have functions which can have variables substituted into them. For some Turing machines, application is often just done as list concatenation, though, this becomes squirley if we want to represent functions which take multiple arguments, nested function application, etc. For a more well-structured model of computation, such as the lambda calculus, application may be a fundamental operation. Regardless, we can then define our metric as;This would require us to be able to pattern match so as to detect
p(y)
. If application is just concatenation, then this is as simple as looking for the suffixy
, which is pretty trivial.This doesn't much resemble what's in the paper, but it makes much more sense.
The paper mentions that
CTM(x)
, which approximates non-conditional Kolmogorov complexity, can be defined asCTM(x|"")
,x
conditioned on the empty string. Well, actually it says thatCTM(""|x)
should do this, but that doesn't make any sense. It's unclear enough in the original, it should definitely beCTM(x|"")
in my modified version since it would just be summing for every programp = p ++ ""
which outputsx
; hence it's eminently compatible with concatenation-as-application. In general, a separate measure would need to be made for other models of computation since application-as-concatenation doesn't even make sense in general for Turing machines (do you really think application should be associative?), much less other models of computation. More generically, we'd define;Block Decomposition
In the section on BDM (Block Decomposition Method), the authors keep calling things "tensors", but it never explains what a tensor is. It definitely isn't in the ordinary mathematical sense since the only things described as tensors are just binary strings. The original paper on BDM talks about compressing tensors and vectors. However, neither of those two things are actually compressed in that paper. Instead, it seems like the authors think that any list is a vector and any list-of-lists is a tensor, which is what they actually compress. It's annoying when authors abuse terminology like this; it's just confusing. From that, I think "tensor" just means a 2-dimensional array of bits. Just call them arrays if that's what they are! This is a CS paper, after all.
I have a suspicion that the segment on block decomposition was copy-pasted from somewhere else without any copyediting. Tensors aren't mentioned outside the section on BDM.
Setting that aside, we're trying to approximate
K(X|Y)
usingBDM(X|Y)
. Assume a fixed "partitioning strategy". The paper never explains what this is, but I read other sources (which I'll talk about later on) which informed me that a "partitioning strategy" is simply a method of splitting up a string into (possibly overlapping) substrings which are already in our CTM database. What BDM tries to do is then devise a "pairing strategy" which minimizes a quantity. The paper doesn't state what a "pairing strategy" is either, and no other source I read clarifies. It only says the following;A pairing strategy generates a set
P
((rx, nx), (ry, ny))
rx
andry
are partitions ofX
andY
made by our partitioning strategyrx
occurring inP
must only occur once, though there is no similar restriction onry
. This just means thatP
, treated as a relation, is (non-totally) functional.nx
andny
are the occurrence counts ofrx
andry
within the partitionings ofX
andY
, respectively.That's all it says on pairing strategies. As far as I can tell from this, a pairing strategy that pairs nothing and is just empty is valid, but I'm pretty sure it's not supposed to be.
Assuming we have an understanding of what additional constraints a pairing strategy should have, we want to find the pairing strategy which minimizes the following quantity;
The minimal value for this quantity will be
BDM(X|Y)
.This quantity will always be nonnegative and we can always minimize it to zero by making
P
empty. This is obviously not intended. It also doesn't make much sense to me that we're taking the log ofnx
ifnx
is just the count ofrx
s rather than something involving the length ofrx
. And shouldn't that log term scale with the difference betweennx
andny
in some way? The paper offers no real intuition.Maybe looking at the original BDM paper can offer clarification. It gives a nice example which I'll reproduce here. Let's say we're applying BDM to the string
We have a choice of partitioning strategy, but it gives the example of splitting the string into substrings of length 12 which may overlap by, at most, 11 digits. When we do this, we get 3
101010101010
s and 4010101010101
s. According to CTM, both strings have a complexity of 26.99 bits (assuming we're using only 2-state binary Turing machines). This would indicate that the smallest program generating the 12 digit string is, at most, 26 digits long. Also, there are no Turing complete 2-state binary Turing machines, so this choice seems doubly weird. We then calculate the BDM value as... Okay, but shouldn't it be way smaller? The original string wasn't even twice as long as its partitions, and it should be almost as easy to generate as the partitions. I thought this might be an idiosyncrasy of the specific kind of Turing machine which the paper uses, but the complexity calculator website says almost the same thing, giving the "BDM algorithmic complexity estimation" as 57.5664 bits when we select a block size of 12 with an overlap of 11.
Let's digress a bit and think of
K
in the lambda calculus. Firstly, we need a way to represent binary strings. We'll just encode these as lists of bits. The type of bits will be defined aswhere
Strings should have an appropriate elimination rule stating that, for any type family or predicate
P
over binary strings,This is essentially the induction rule for binary strings. One form of it, anyway. We can replace that predicate with a polymorphic variable to get our representation.
Compare
For any particular string,
S
, we can realize the induction principal usingSee
I'll talk about alternate representations later on, but I think this is among the most natural representations given the mathematical structure of the data type. "Most natural" doesn't necessarily mean "best", though. Unary natural numbers are more natural than binary representations since pretty much every simple representation of the universal property of ℕ suggests a unary representation. However, they're horribly inefficient.
Using this representation, the original string would be encoded as
This representation has about 192.747 bits of information. That information count might seem like a lot, but the lambda calculus represents trees of any kind as efficiently as it can represent lists. In this case, the string is a list of bits, which is about as small as can be expected.
However, it can be compressed to;
which has about 83.93 bits of info. One of the 12 character sub-partitions can be compressed into
For the other, just swap
0
and1
. This has about 83.93 bits; the exact same, in fact, as the full string. The reason why these are the same is that there are 9 repetitions of01
in the full string and9 = 3 ^ 2
. In the substrings, there are 6 repetitions of01
or10
, and6 = 2 * 3
. The information in multiplication is the same as the information in exponentiation, so the representations end up having the same amount of info. Of course, I don't know if these are actually the smallest possible representations; these are just the smallest I could come up with. They do illustrate my point, however. The two strings should have about the same information; maybe the original should have a little more. It seems extremely suspicious to me that the two strings have such dramatically different bit-counts according to BDM.This isn't the only way to represent binary strings. We can write the original string instead as;
To justify this representation we need to prove that its type is isomorphic to something satisfying the universal property of binary strings. Here,
1
and0
are expected to take a functionX → X
and return another function of the same type. This means our new representation has type;we can rewrite this using a bit of type algebra;
Note that any type of the form
∀ X . (F(X) → X) → X
is the (weakly) initial algebra over the endofunctorF
. Binary strings are the initial algebra over the endofunctorX ↦ 2 × X + 1
, a special case of initiality for lists in general. Continuing this calculation;Which is the representation we started with. This justifies that the new representation is isomorphic to the old one and we can comfortably use it interchangeably. See also;
Our new representation has about 90.49 bits of information, while the sub-strings have about 62.63 bits. We can compress our larger string to
which has about 59.858 bits of information, nearly what BDM states the Turing machine representation should have. And the lambda calculus has to represent every tree-like datatype! What's the Turing machine representation doing with all that space below 57 bits if it can't even fit 9 repetitions of
01
? My point from before still stands; both strings have similar amounts of algorithmic information. It's suspicious that BDM would say otherwise....
Okay, I think I figured it out. I was confused about the partitioning strategy. It's up to us to find a good selection of a block-size and overlap. Going back to the complexity calculator, if I set the block-size to 2 and the overlap to 1, it estimates the complexity to be 12.82 bits. Doing the calculation myself, we have 9 repetitions of
01
and 8 repetitions of10
. The calculation would then be;Going through all the options in the complexity calculator, the minimal is a block-size of 2 with an overlap of 0. This partition only has 9
01
s, and nothing else. This gives the calculation as;The BDM paper states that
where
Adj(X)
is the set of pairs of substrings with their occurrence counts. The block-size of 2 with an overlap of 0 gives an|Adj(X)|
of exactly 1 since it only has one partition;Adj(X) = {(01, 9)}
. This should, presumably, be as close to the actual Kolmogorov complexity as BDM can get. I suppose that means we're trying to find the partition strategy which minimizes the number of substrings that are covered by the CTM database.The appendix of the paper has a section on "The Impact of the Partition Strategy". It says the following;
as the number of elements... what? Was this paper not copyedited!
I assume that this doesn't actually rely on our data being tensors (or 2-dimensional arrays).
Hmm... this seems like a big gap in the whole approach.
So this clears up some things about the partitioning strategy, at any rate. But I wasn't worried about the partitioning strategy anyway; I wanted to know what a "pairing strategy" is! The original paper on BDM isn't any help since it doesn't describe conditional BDM at all.
Going back to the topic paper of this post, it does describe a "coarse conditional BDM of
X
with respect to the tensorY
". Again, tensors are not explained at all in the paper, and it's unclear ifY
actually needs to be a tensor in any mathematical sense. As I stated before, I think the authors just mean a 2-dimensional array when they say "tensor", and it seems obvious that the construction doesn't rely on dimensionality at all. It definesBDM(X|Y)
asThis definition isolates the unique information in
X
while issuing additional penalties if the information shared betweenX
andY
appears more or less often inX
than inY
. I'm not sure if this makes sense. The paper says;but, intuitively, it seems to me like the sum should only add a penalty if
nx > ny
; because, otherwise, we're penalizing the conditional complexity ofX
for content that's inY
but not inX
. I'll have to think about this a bit more.The "coarse" BDM is, I guess, less accurate than the "strong" BDM that I first looked at; but, at least, it makes sense. It's weaker since it doesn't utilize conditional CTM. But without additional clarification on what a "pairing strategy" is, I just can't understand how the strong version works.
I've thought a lot about it and, while I'm not confident, I think I've figured out the two most reasonable fixes.
If the pairing strategies must cover
X
then that solves the specific problem I pointed out.If P is supposed to be totally functional over the partitions of
X
.Neither of these conditions is hinted at in the paper, but it's the best I've got. The paper does say;
So, at the very least, the pairing strategy is supposed to be determined by some algorithm that isn't described in any detail. I'm frustrated by this whole thing.
Spit-balling ways to improve BDM and CTM
One of the thoughts I had was that there may be program patterns in the computational model that always uniformly evaluate. For example, it may always be the case that
100101X10101Y0
, for any stringsX
andY
, evaluate to1100101
, or maybe0Y1010
. In either case, we can make the replacement to get a smaller program without running the program. Or maybe it reduces to01Y0Y01
. Depending on the length ofY
this may or may not be a reduction in size. And there may be variations on this involving variable substrings being riffled and combined in nontrivial ways.The CTM database only keeps track of string reductions, but it may be possible to search for patterns and perform block replacement via unification/pattern matching instead. This description was given in terms of first-order unification, but there may be a close link with higher-order unification; unification modulo computation.
This also seems similar to finding smaller extensionally equivalent expressions to certain programs. That is, finding programs that don't normalize to the same term but do normalize to the same term when given the same inputs. Programs that are not the same but are behaviourally indistinguishable. I wrote a program a while ago in Mathematica which enumerated SKI expressions, applied 30 or so variable arguments to them, and collected them into pairs which normalized to the same expression after application. In this way, I built up a database which I used to make expressions smaller by replacing expressions with smaller extensionally equivalent ones. In practice, this ran into some issues. Namely, two extensionally equivalent expressions are not guaranteed to have the same behavior since one may start evaluating after two arguments while the other will evaluate after three. For example, the following two expressions are extensionally equivalent, but they only normalize to the same term after two arguments are applied despite the first fully normalizing after only one application.
This only ceases to be a problem if you have some sort of typing discipline which can allow you to infer the number of arguments an expression expects. You can then assess extensional equivalence up to that number of arguments while also guaranteeing the preservation of expected behavior up to the type of the full expression you're compressing. This, of course, doesn't work on extensional equivalences which require some elimination principle to justify; e.g. the equivalence of merge sort with quicksort. This may be particularly relevant to incremental compression.
Generating programs with specific, even simple, types is highly nontrivial. It's just the proof synthesis problem where you want to enumerate all proofs (encoding programs) of a given theorem (encoding a type). Restricting to certain typing disciplines, such as simple types without intersection types, certain polymorphic typing disciplines, some refinement type disciplines, some disciplines heavily leaning on algebraic data types, and some others, can be searched fairly efficiently, however. The following papers seem particularly relevant;
This may be leverageable for some applications. In fact, I'd guess most applications could leverage this.
It's also worth noting that this whole problem is just the inductive programming problem + a desire to minimize the induced program's size. Inductive programming is a whole field unto itself. There are algorithms which can exhaustively produce programs which have a particular output sequence;
See;
These approaches can be used as compression methods. They're not guaranteed to approach the Kolmogorov complexity, but they should generally do a much, much better job than statistical compression methods while being much more efficient than methods attempting to approximate
K
directly.Consider the possibility of an algorithm,
C(X)
, and a Shannon optimal compressor,S(X)
, such that the sum over allX
ofC(X) - S(X)
is negatively infinite. That is to say,C(X)
tends to perform better infinitely much as a Shannon optimal compressor while being no slower. BDM already does this, but is there an algorithm that can do this without keeping track of a large database which takes exponential effort beforehand to calculate?It's actually fairly easy to outperform Shannon optimal compressors by infinitely much. Have a compressor do the following; If it detects a string encoding
0, 1, 2, 3, 4, 5, ...,
replace it with a flag indicating such and a number indicating how far the sequence counts. For all other strings, replace them with the output of a Shannon optimal compressor. Such a scheme would generally perform no worse than a Shannon optimal compressor while performing better by infinitely much; though the benefits clearly only apply to a small number of patterns overall, even if there are infinitely many such patterns. This means that CTM will generally be able to improve by infinitely much by adding entries to its database, but expanding this database takes exponential effort. Is there a way to do better? Is there a way to characterize how far this can go without exponential effort? Even if it doesn't cover as much of program space asymptotically, is there a way to grow the database forever using only, say, linear effort? Or quadratic? Or logarithmic? And could such things be efficiently sewed together so that we discover the easy things first and foremost, for even very large strings, and the hard things later?Is easy to compress algorithmically, but I wouldn't expect BDM to do much better than statistical compression. I do believe that such things should be efficiently discoverable anyway, but not by CTM or BDM, as it stands.
I think we may need to start thinking about the amount of effort we'd be willing to expend during compression.
K
and its alternatives seem somewhat backward in their definition.K
is the minimal size of a program that can become our target if we're willing to expend an arbitrarily large effort during decompression. Levin complexity is justK
plus the log of the runtime of the smallest program. Essentially, Levin complexity is the minimal size of a program that can become our target if we're willing to expend only an exponential amount of effort on decompression. But, shouldn't it be the other way around? Shouldn't we care more about the amount of effort we want to put during compression?For the purposes of ML, we don't care about decompression at all. What would be better to know is how small a program can be if we're only willing to spend exponential, quadratic, linear, etc. effort with respect to the length of the string we're compressing. Are these problems solvable? I have a suspicion that feedforward NNs are essentially solving a subset of the linear effort case.
This is an area which has been explored quite a bit; by the paper on Universal almost optimal compression I mentioned earlier, for instance. I would like to explore this area in the future, and I believe it will be of tremendous importance for making compression practical for machine learning.
Here's another idea. This one is original to me, but I wouldn't be surprised if someone came up with something similar. Some models of computation can be run backward, nondeterministically. Specifically, models where every state can be reached in one step by only a finite number of transitions. This can't be done effectively with the lambda calculus. If we were in a state
λ x . λ f . f x x
, that could have been reached in one step byand infinitely many other things. This means that running a lambda expression backward implies enumerating infinite possibilities at each step. That doesn't mean running expressions backward is impossible, but it limits the utility of such an approach since we'd basically be enumerating every lambda expression an infinite number of times at each backward step. The same applies to combinator logic.
Many models of computation, however, don't have this property. Anything where a fixed amount of work is done at each step don't; that includes Turing machines, interaction nets, the linear lambda calculus, and most abstract machines. These can all be run backward, as a result. We can then enumerate all the programs which normalize to a particular output by doing the following, assuming we're using an appropriate Turing machine;
This procedure will look for and find only those programs which normalize to our desired output, ordered by running time. We can keep this going for as long as we want, remembering only the smallest program found so far. The longer we go, the closer our approximation is to the actual shortest program and therefore the actual Kolmogorov complexity. There are also probably heuristics we could apply to prune the search of paths which won't get us anything smaller than what we already have. I don't know how efficient this could be made, but it seems to me that it would do better than BDM on large strings.
For parallel models of computation, such as interaction nets, we can optimize this further by treating the (backward) transformations of different segments of the program independently and only combine the timelines when they start interacting.
A further optimization, which would make finding small programs easier but would make finding values verifiably close to
K
harder, is to iteratively compress programs. We run a program backward only until it shrinks. We then abandon all other search branches and start over with the new, smaller program. Doing this over and over may allow one to effectively find and run recursive programs that generate an output backward much more efficiently than an unpruned search.Here's another idea. In CTM, we're enumerating all programs and seeing what they output. It may, instead, be better to enumerate programs and then filter them for randomness. Essentially, we'd build up a database by looking at each program in algorithmic order. We'd try building that program from the programs already in the database. If we can't build it using the existing elements in the database in a way that's smaller than what we're trying to build then the thing we're looking at is algorithmically random and we add it to the database as a random "atom". This should significantly cut down on the combinatorial explosion, though, I'm pretty sure it would still be exponential.
Ultimately, I think this whole problem should eat itself. If we're using a universal learning algorithm, then, certainly, it can learn to be better, somehow. Bootstrapping should be possible in this domain.
Algorithmic Loss
Hey, wasn't this post supposed to be about machine learning? Oh ya! Let's finally talk about that!
For any standard ML technique, we need a definition of Loss which we want to minimize. The exact way this will be defined will depend on what our task is. Generally, we'll use some form of conditional Kolmogorov complexity directly as our loss. Similar measures are already used in some applications. Cross-entropy is the most directly related loss, and using algorithmic complexity can be thought of as a more robust form of entropy.
Generally, our loss will try to capture how much of the data isn't captured by our model. We want our loss to answer the question, "given our model, how much of the data still needs to be explained?" To that end, our loss will generally be a function of the training data conditioned on our predictions. Given a data point
y
and a predictionY
, our loss on that point will beK(y|Y)
. To get a total loss we can just add all these measures together.But the paper suggests adding the squared losses together. Why should we do this? Why do we do this for normal ML? Well, we can't normally just add together the losses since they can often be negative.
K
can never give negative values, so that's not a problem here. Why would we use the mean squared error rather than the mean absolute error? There are two explanations I've seen. Firstly, I've seen some say MSE is easier to differentiate than MAE. This isn't true outside of very simple models where you want to find the optimum in a single step, such as linear regression, and we aren't going to be differentiatingK
anyway, so this doesn't matter. The other reason comes from an assumption that we're modeling things sampled from a gaussian distribution.This has always irked me. No matter how many statisticians say "it's reasonable to assume everything is being sampled from a gaussian", that doesn't make it true. If you do any bayesian ML, you'll find that a significant effort spent on standard techniques is in distribution engineering. If what you're modeling can't be negative, is multi-modal, follows a power-law, or a litany of other things then you're not looking at data sampled from a gaussian and you'll have to make a different distribution instead.
Anyway, let's do the derivation;
Firstly, what we always really want to do is maximize the likelihood. Our model is going to make predictions about the probability of various data. The likelihood of our model is just the product of all the probabilities of each training point as predicted by our model.
In practice, this will usually end up multiplying a bunch of numbers below 1 together, getting a vanishingly small likelihood for models training on a lot of data. Because of this, we usually use the negative log-likelihood, which is
This makes all our too-small numbers large without losing any information, so this is usually what real algorithms try to minimize. On a historical note, this trick was often used to make carrying out tough calculations easier. Logorithm tables were a hot commodity back before calculators became commonplace. Anyway, we often also divide by the total number of data points to turn this log-likelihood into a mean log-likelihood, that way our loss doesn't become huge just because we're working with a lot of data points.
The equation of a gaussian is;
The "prediction" made by a Gaussian model will be the mean,
μ
, and the likelihood of a particular piece of datax
will be that data fed into the PDF of a gaussian with the predicted mean. Substituting with those changes into our negative log-likelihood, this becomeswhich is exactly the squared error, modulo some constant we don't care about. And getting the average by dividing by the number of data points will get us the mean squared error, MSE. This should also illustrate that if you don't think it's reasonable to assume your data were randomly sampled from a gaussian distribution, then you should also not think it's reasonable to use the squared error without a similar derivation.
Okay, so, what's the justification for squaring
K
? Let's think about this, what are the probabilities in our likelihood? Well, they'll be the algorithmic probabilities; the probability that a random program will output the datapoint when given our model's prediction as an input. The coding theorem says exactly that (within an additive constant) the Kolmogorov complexity of a program is the negative logarithm of the algorithmic probability, meaning the appropriate negative log-likelihood is exactly the sum ofK
s.But, wait, I was looking for justification for squaring
K
. That's what the paper does. Does it say why?Oh, so there is no reason. To be clear, there is a clearly right thing to do here; use the sum of
K
s, not squaredK
s. We may also want to divide by our number of data points to get the meanK
error rather than just the total error. I don't think the author's thought very hard about what the loss should be. For much of the paper this odd, clearly wrong loss function will be used.The paper goes on to talk about categorical loss. The obvious thing, to me, is to do basically the same thing, and that's what the paper recommends. Assume our model is outputting some object which is being used to predict the class. In classical ML, this would be like the class probabilities before assigning a class. The loss will be
K(Y|M(X))
, whereY
is the actual class andX
is our input data. This signifies how much information is in the real class but not in the prediction of our model. If we were using our model for prediction, then the classC
which minimizesK(C|M(X))
would be our prediction.The paper relates this to clustering by algorithmic distance. I don't see the connection, but it does reference another generic version of a standard ML technique; specifically, it points out the paper Clustering by Compression.
Algorithmic Optimization
Great! So, we know how to assess a model, how do we actually do optimization? Prepare to be disappointed, because the answer offered in the paper is "brute search"! Well, the paper phrases it in a more, umm, appealing way. The optimization algorithm is defined as follows;
minCost
which is initially set to infinity.param
, which is, I guess, initially set tonull
, or something.K
).minCost
, then setminCost
to this value and setparam
to the parameters being used.That's it. However, there are some complications that make me not entirely sure if this is right. It defines the cost function (the sum of squared
K
s) to beJ_a(ˆX, M)
, whereˆX
is our dataset, andM
is the model we're assessing. However, the actual description of the algorithm tells us to useJ_a(ˆM, σ_i)
, whereσ_i
is the ith parameter in our parameter stream andˆM
is never defined. Presumedly,ˆM
has something to do with our model, but it can't possibly be a replacement forˆX
sinceˆX
is just a collection of input-output pairs and our model is, obviously, not. Either there's an entirely separate loss function over models and parameters which is never defined, or there was a typo and the algorithm should have saidJ_a(ˆX, M{σ_i})
, or something like that. The version I wrote seems pretty intuitive (if overly simplistic), so I'm leaning toward the latter.The paper states that this algorithm "can ['minimize
K(M)
and minimize the cost function'] in an efficient amount of time". But, uh, no it doesn't. It does as bad as a brute force search because it is a brute force search. It goes on to sayYa think? I'm not confident that this would work on anything but toy problems, but, who knows, maybe I'm wrong and this actually works surprisingly well on real-world data, but I doubt it. The algorithm doesn't even try taking advantage of what it knows about good parameters.
As an aside, the paper mentions that
giving a few citations. This seems more reasonable to me, as such methods aren't just brute-force searching.
This will be a bit of a digression, but if you read this far you probably don't care about that. The first example it uses is a regression problem on two variables. It says the following on the ability to enumerate the parameter space;
This entire paragraph is rather head-scratching. Computers certainly can fully represent a real number. We can figure out how by following the same basic procedure I used before to figure out how to encode binary strings. You just state a universal property of the mathematical object you want to represent and derive a type of realizers. This is a bit squirrely with the real numbers as the exact universal properties diverge in constructive settings. Dedekind Reals and Cauchy Reals aren't isomorphic anymore, for instance. There are also practical questions about how to make calculating with them as easy as possible. That being said, the simplest universal property for any kind of real number I'm aware of is the following; the (nonnegative) real numbers are the final coalgebra of the endofunctor
X ↦ ℕ × X
. There are a few places that say this in various guises. The most direct is On coalgebra of real numbers which is all about this observation. See this as well. This basically says that real numbers are an infinite stream of natural numbers. There are a few ways of viewing what this represents, and that will largely determine whether you see each number as representing a nonnegative real or something isomorphic, like something in[0, 1)
. For the latter, we can read each natural number as describing how many 1s we encounter before encountering a0
in the binary expansion of a number. For example;Of course, we can map this back on to the non-negative reals by interpreting each number
x
as1/(1-x) - 1
instead. Then we'd haveAlternatively, we can imagine each entry in a sequence
[a0, a1, a2, ...]
as representing a number as a simple continued fraction of the formThis can represent any nonnegative real number.
whichever interpretation we use will determine how we define, for instance, addition, multiplication, etc. Note that there are some complications with representing real numbers as continued fractions in this way. Notably, that some numbers don't have unique representations. While I understand these caveats, I don't understand how to solve them, though I've been told that such solutions are "various".
Following Recursive types for free!, the (weakly) final coalgebra of
X ↦ ℕ × X
can simply be defined asWe can construct a real number by fixing a type,
X
, giving anX
as a seed, and then defining a method of generating new digits and new seeds from an old seed. For example, we can construct an infinite stream of zeros by settingX
to be⊤
, the unit type, giving•
, the only inhabitant of⊤
, as our seed, and defining our generator asλ x . (0, •)
. In full, we'd haveor, if we want to be explicit about all our encodings;
This means that the real number zero has, at most, about 32.511 bits of complexity; surprisingly small for something which is supposedly infinitely large.
The usual reason we'd want to use floating-point numbers over exact real numbers is efficiency; floating-point numbers are much faster to compute with since our computers have hardware dedicated to computing with them. But this approach is representing the parameters as raw outputs of some virtual computer anyway, so that doesn't apply here. To use floating points, we'd have to convert them to some encoding of floats in our computational model. We get no efficiency in using floats here.
We can make our representation a bit more efficient. Following the Coinductive function spaces page, we can use a few isomorphisms to change this representation. Notably, it's generally the case that
for any
A
. Sinceℕ
is the initial algebra over the endofunctorX ↦ 1 + X
, the above can be rewritten as;So we can redefine the nonnegative reals as just functions from
ℕ → ℕ
. Neat! Following this, we can define zero instead as;as the constant function which just returns
0
for any input. This has only about 6.9 bits! Not bad for representing infinite many digits. It would actually be MORE complex if we truncated it to finitely many digits. We may even notice that the least complex real number will simply be encoded by the identity function. This will be the number who's continued fraction is[0, 1, 2, 3, 4, ...]
. As it turns out, this number iswhere the
I
s are Bessel I functions. This is called the "Continued Fraction Constant". Or, if we were interpreting the number as representing the binary expansion, this would beWhere ϑ is an elliptic theta function. I don't know if this constant has a name, but I couldn't find it anywhere. When using our previous map to turn this into the full positive reals, this becomes
≈ 0.558524
.Anyway, my whole point with this exercise was to show we can represent real numbers, and many other mathematical structures besides, just fine on computers. We don't, and, in fact, shouldn't use floating-point numbers if we're going to take algorithmic complexity seriously. The original BDM paper mentions π a few times, saying, for instance,
so the authors know that π (and presumably other real numbers), in all of its infinite digits, can be represented by an algorithm. But in this paper, they insist on using floating-point numbers. Why? The paper just says that the parameter space becomes enumerable, but we can effectively enumerate the (constructive) reals by enumerating all the inhabitants of the type
the output of such a procedure, if it were done in some breadth-first manner, would output encodings for real numbers in essentially algorithmic order.
I think the authors need a crash course in higher-type computability. There's a whole wide world that you're missing out on if you really believe computers can only represent discrete data types.
Final Thoughts
The rest of the paper just goes through some usage examples. I don't feel the need to summarize them, but you may find them interesting to look at. The cellular automata classifier was a particularly good illustration. In the "hybrid machine learning" section, an interesting suggestion is to use
K
as a regularization method on top of another loss function. The same section also suggests giving training weights to simpler samples from a given space so that a model prioritizes the most plausible samples. I don't know how effective these would be, but they're interesting suggestions which may be useful as additional tools in the ML toolbox.The conclusion section refers to the whole framework as a "symbolic inference engine" being integrated into traditional ML. I... wouldn't phrase it that way. There's not much inference and even less symbology. That being said, a type-sensitive version of these ideas might be better.
I'm not satisfied with the methods presented in this paper. Nothing in it connected back to that idea of "whichever addition increases the compressed size the least should be considered the likeliest prediction" I mentioned at the beginning of this post. All the paper said, really, was "when tuning parameters, just look through them in order from least to most complex. Also, use Kolmogorov complexity, not Shannon entropy, to measure complexity." I'll keep that in mind, but I think I'd want to look at other methods. I think something more carefully designed will need to be made for practical ML.
I took a closer look at those evolutionary methods mentioned. Specifically, I read;
I cannot recommend that first source. It's a very fluffy soft pop-sci book. It does contain the basic idea of algorithmic evolution, though. Chaitin calls it "metabiology", but I don't like that term, so I'll call it "algorithmic evolution". It's basically the following;
That's literally it. The idea, I suppose, is to cut down on algorithmically unlikely mutations; biasing the search in a way which is helpful to tasks which are not, themselves, essentially random. That, of course, would be just about any task in practice. According to Chaitin, that, alone, is enough to speed up evolutionary convergence to a mere quadratic search. The book offers no reasoning for this, it just asserts it, and I don't know where I'd need to go to find such justification. Chaitin tries offering intuition for it, but it's bonkers nonsense. He points out that an oracle would be needed to avoid non-terminating programs. He then states that each usage of this oracle would add one bit of "creativity" to our search, and that's where the power comes from. Utter drivel.
If this idea is valid, then some kind of type-driven mutator synthesis could be done to speed up the process. We'd essentially specify the type of our genome as, likely, an AST describing some program of a fixed type. The mutators would be endofunctions over this type of ASTs. A type-driven synthesizer would be able to fully search mutation space while avoiding either nonsensical or non-terminating mutators. Generating Constrained Test Data using Datatype Generic Programming would be particularly relevant.
The paper implements a strange version of this idea, where a scan is made over a data structure and a CTM database is used to estimate the algorithmic probability of any given mutation. Mutations are then sampled according to these probabilities and then applied. Here's the thing; we don't need to know the algorithmic probability of an outcome to sample them according to that probability. We can just randomly generate programs and the sampling will automatically be done according to the algorithmic probability. It's not clear why they didn't just do that, but, whatever. The paper does demonstrate a clear speedup visible by simply looking at the graph results on the tests they did; so that's something.
I have a strong suspicion that some ideas in stochastic superoptimization could also be combined with algorithmic evolution. In particular, the methods used to justify how densely explored some space was.
I also suspect that some variation of a bandit algorithm may benefit from these ideas. The paper Information-gain computation and its followups described a hypothetical system which uses a contextual bandit for exploring a logic-program like search space. The expert that the bandit uses simply calculates the Kullback–Leibler divergence, a statistical entropy metric, of the history for each branch to give recommendations. The idea is that desirable histories should maximize the information gained about our goal as we go down it. A better system might use a
K
approximation rather than an entropy measure. Though, the paper also suggests using an RNN to do the history compression, which I'd expect to give a lower value than the entropy since it would be able to exploit the temporal structure of the history for compression in a way that statistical compression wouldn't be able to do....
THE END