The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.
Machine learning models from generalized Laver tables?
I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.
But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.
Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.
Let be a set, and let be a binary operation on that satisfies the self-distributivity identity . Define the right powers for inductively by setting and . We say that is a reduced nilpotent self-distributive algebra if satisfies the identities for and if for all there is an with . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra where if for each , then for some . Here we make the convention to put the implied parentheses on the left so that .
Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If is a self-distributive algebra, then we can define and in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define precisely when and precisely when . We can define to be the equivalence class of with respect to and . The operations on are and whenever we have our composition operation . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra where are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but but there is a surjective homomorphism . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.
Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.
From Laver-like algebras, we can also produce non-commutative polynomials. If is a Laver-like algebra, and is a generating set for , then for each non-maximal critical point , we can define a non-commutative polynomial to be the sum of all non-commutative monomials of the form where and but where for . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.
I have used these non-commutative polynomials to construct the infinite product formula
where the polynomials are obtained from different non-trivial rank-into-rank embeddings . This means that the non-commutative ring operations in the rings of non-commutative polynomials are meaningful for Laver-like algebras.
If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.
Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.
Using deep learning to investigate generalized Laver tables
We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.
The n-th classical Laver table is the unique self-distributive algebraic structure where whenever . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.
Problem: Compute the -th classical Laver table for as large as we can achieve.
Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table and with machine learning, I have personally gotten up to (I think I have a couple of errors, but those are not too significant), and I have some of the computation of .
Suppose now that . Then Dougherty has shown that one can easily recover the entire function from the restriction where for all . Suppose now that . Then let denote the least natural number such that . Then there is a number called the threshold of at such that and and such that for every with , we have whenever and for . The classical Laver table can be easily recovered from the table and the threshold function . To go beyond Dougherty's calculation of to and beyond, we exploit a couple of observations about the threshold function :
i: in most cases, , and
ii: in most cases, for some .
Let , and let where is the least natural number where and . Then one can easily compute from and the data by using Dougherty's algorithm. Now the only thing that we need to do is compute in the first place. It turns out that computing is quite easy except when is of the form or for some . In the case, when or , we initially set We then repeatedly modify until we can no longer find any contradiction in the statement . Computing essentially amounts to finding new elements in the set and we find new elements in using some form of machine learning.
In my calculation of , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in to find new elements in , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.
My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in belong to and has a natural tensor product structure. One can therefore consider to be a subset of . Furthermore, a tuple of elements in can be considered as a subset of . This means that the linear layers from to should be Kronecker products or Kronecker sums (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices are defined by setting and respectively. Perhaps neural networks can be used to generate new elements in . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.
Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.
Solution: The non-commutative polynomials are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials into unique sequences of real matrices (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.
Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.
Solution: The set of all multigenic Laver tables over an alphabet forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.
I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.
Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.
This post gives an example of some calculations that I did using my own machine learning algorithm. These calculations work out nicely which indicates that the machine learning algorithm I am using is interpretable (and it is much more interpretable than any neural network would be). These calculations show that one can begin with old mathematical structures and produce new mathematical structures, and it seems feasible to completely automate this process to continue to produce more mathematical structures. The machine learning models that I use are linear, but it seems like we can get highly non-trivial results simply by iterating the procedure of obtaining new structures from old using machine learning.
I made a similar post to this one about 7 months ago, but I decided to revisit this experiment with more general algorithms and I have obtained experimental results which I think look nice.
To illustrate how this works, we start off with the octonions. The octonions consists of an 8-dimensional inner product space together with a bilinear operation and a unit where for all and where for all . The octonions are uniquely determined up to isomorphism from these properties. The operation is non-associative, but the is closely related to the quaternions and complex numbers. If we take a single element in , then generates a subalgebra of isomorphic to the field of complex numbers, and if and are linearly independent, then spans a subalgebra of isomorphic to the division ring of quaternions. For this reason, one commonly thinks of the octonions as the best way to extend the division ring of quaternions to a larger algebraic structure in the same way that the quaternions extend the field of complex numbers. But since the octonions are non-associative, they cannot be used to construct matrices, so they are not as well-known as the quaternions (and the construction of the octonions is more complicated too)
Suppose now that is an orthonormal basis for the division ring of octonions with . Then define matrices by setting for all . Our goal is to transform into other tuples of matrices that satisfy similar properties.
If are matrices, then define the
-spectral radius similarity between and as
where denotes the spectral radius, is the tensor product, and is the complex conjugate of applied elementwise.
Let , and let denote the maximum value of the fitness level such that each is a complex anti-symmetric matrix (), a complex symmetric matrix (), and a complex -Hermitian matrix () respectively.
The following calculations were obtained through gradient ascent, so I have no mathematical proof that the values obtained are actually correct.
,
,
, ,
, ,
, ,
, ,
, ,
, ,
Observe that with at most one exception, all of these values are algebraic half integers. This indicates that the fitness function that we maximize to produce behaves mathematically and can be used to produce new tuples from old ones . Furthermore, an AI can determine whether something notable is going on with the new tuple in several ways. For example, if has low algebraic degree at the local maximum, then is likely notable and likely behaves mathematically (and is probably quite interpretable too).
The good behavior of demonstrates that the octonions are compatible with the -spectral radius similarity. The operators are all orthogonal, and one can take the tuple as a mixed unitary quantum channel that is very similar to the completely depolarizing channel. The completely depolarizing channel completely mixes every quantum state while the mixture of orthogonal mappings completely mixes every real state. The -spectral radius similarity works very well with the completely depolarizing channel, so one should expect for the -spectral radius similarity to also behave well with the octonions.
It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.
Our goal is to transform a list of -matrices into a new and simplified list of -matrices . There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for , but in other cases, we would like the matrices to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.
We measure similarity between tuples of matrices using spectral radii. Suppose that are -matrices and are -matrices. Then define an operator mapping matrices to
-matrices by setting . Then define . Define the similarity between and by setting
where denotes the spectral radius. Here, should be thought of as a generalization of the cosine similarity to tuples of matrices. And is always a real number in , so this is a sensible notion of similarity.
Suppose that is either the field of real or complex numbers. Let denote the set of by matrices over .
Let be positive integers. Let denote a projection operator. Here, is a real-linear operator, but if is not complex, then is not necessarily complex linear. Here are a few examples of such linear operators that work:
(Complex symmetric)
(Complex anti-symmetric)
(Complex Hermitian)
(real, the real part taken elementwise).
(Real symmetric)
(Real anti-symmetric)
(real symmetric)
(real anti-symmetric)
Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for along with operators that I have forgotten about.
We are now ready to describe our machine learning algorithm's input and objective.
Input: -matrices
Objective: Our goal is to obtain matrices where for all but which locally maximizes the similarity.
In this case, we shall call an -spectral radius dimensionality reduction (LSRDR) along the subspace
LSRDRs along subspaces often perform tricks and are very well-behaved.
If are LSRDRs along subspaces, then there are typically some where for all . Furthermore, if is an LSRDR along a subspace, then we can typically find some matrices where for all .
The model simplifies since it is encoded into the matrices , but this also means that the model is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices tend to have mathematical properties that I can't explain and am still exploring.
Universities are altogether unprofessional, so it is probably best for everyone to shame them and regard the degrees from these universities are completely worthless. Universities promote violence and they refuse to apologize or acknowledge that there is any problem whatsoever.
Since AI interpretability is a big issue for AI safety, let's completely interpret the results of evolutionary computation.
Disclaimer: This interpretation of the results of AI does not generalize to interpreting deep neural networks. This is a result for interpreting a solution to a very specific problem that is far less complicated than deep learning, and by interpreting, I mean that we iterate a mathematical operation hundreds of times to get an object that is simpler than our original object, so don't get your hopes up too much.
A basis matroid is a pair where is a finite set, and where denotes the power set of that satisfies the following two properties:
I ran a computer experiment where I obtained a matroid where and where each element in has size through evolutionary computation, but the population size was kept so low that this evolutionary computation mimicked hill climbing algorithms. Now we need to interpret the matroid .
The notion of a matroid has many dualities. Our strategy is to apply one of these dualities to the matroid so that the dual object is much smaller than the original object . One may formulate the notion of a matroid in terms of closure systems (flats),hyperplanes, closure operators, lattices, a rank function, independent sets, bases, and circuits. If these seems to complicated, many of these dualities are special cases of other dualities common with ordered sets. For example, the duality between closure systems, closure operators, and ordered sets applies to contexts unrelated to matroids such as in general and point-free topology. And the duality between the basis, circuit, and the hyperplanes may be characterized in terms of rowmotion.
If is a partially ordered set, then a subset is said to be an antichain if whenever , then . In other words, an antichain is a subset of where the restriction of to is equality. We say that a aubset of is downwards closed if whenever and , then as well. If , then let denote the smallest downwards closed subset of containing . Suppose that is a finite poset. If is an antichain in , then let denote the set of all minimal elements in . Then is an antichain as well, and the mapping is a bijection from the set of all antichains in to itself. This means that if is an antichain, then we may define for all integers by setting .
If is a basis matroid, then is an antichain in , so we may apply rowmotion, so we say that is an -matroid. In this case, the
-matroids are the circuit matroids while the -matroids are the hyperplane matroids. Unfortunately, the -matroids have not been characterized for . We say that the rowmotion order of is the least positive integer where . If is a matroid of order , then my computer experiments indicate that whichs lends support to the idea that the rowmotion of a matroid is a sensible mathematical notion that may be satisfied mathematically. The notion of rowmotion of a matroid also appears to be a sensible mathematical notion for other reasons; if we apply iteratively apply a bijective operation (such as a reversible cellular automaton) to a finite object , then that bijective operation will often increase the entropy in the sense that if has low entropy, then will typically have a high amount of entropy and look like noise. But this is not the case with matroids as -matroids do not appear substantially more complicated than basis matroids. Until and if there is a mundane explanation for this behavior of the rowmotion of matroids, I must consider the notion of rowmotion of matroids to be a mathematically interesting notion even though it is currently not understood by anyone.
With the matroid obtained from evolutionary computation, I found that has order which factorizes as . Set . By applying rowmotion to this matroid, I found that ={{1, 8, 9},{2, 3, 6, 8},{2, 3, 7, 9},{4, 5},{4, 6, 9},{4, 7, 8},{5, 6, 9},{5, 7, 8}}. If is a basis matroid, then , so the set is associated with a unique basis matroid. This is the smallest way to represent in terms of rowmotion since if , then .
I consider this a somewhat satisfactory interpretation of the matroid that I have obtained through evolutionary computation, but there is still work to do because nobody has researched the rowmotion operation on matroids and because it would be better to simplify a matroid without needing to go through hundreds of layers of rowmotion. And even if we were to understand matroid rowmotion better, this would not help us too much with AI safety since here this interpretability of the result of evolutionary computation does not generalize to other AI's and it certainly does not apply to deep neural networks.
I made a video here where one may see the rowmotion of this matroid and that video is only slightly interpretable.
Deep matroid duality visualization: Rowmotion of a matroid
It turns out that evolutionary computation is not even necessary to construct matroids since Donald Knuth has produced an algorithm that can be used to construct an arbitrary matroid in his 1975 paper on random matroids. But I applied the rowmotion to the matroid in his paper and the resulting 10835-matroid ={{1, 2, 4, 5},{1, 2, 6, 10},{1, 3, 4, 6},{1, 3, 4, 7, 9},{1, 3, 6, 7, 9},{1, 4, 6, 7},{1, 4, 6, 9},{1, 4, 8, 10},{2, 3, 4, 5, 6, 7, 8, 9, 10}}. It looks like the rowmotion operation is good for simplifying matroids in general. We can uniquely recover the basis matroid from the 10835 matroid since is not a basis matroid whenever .
I have originally developed a machine learning notion which I call an LSRDR (
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that is a natural number. Let denote the quantum channel that takes an qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If are complex matrices, then define superoperators and by setting
and for all .
Given tuples of matrices , define the L_2-spectral radius similarity between these tuples of matrices by setting
.
Suppose now that are matrices where . Let . We say that a tuple of complex by matrices is an LSRDR of if the quantity is locally maximized.
Suppose now that are complex -matrices and is an LSRDR of . Then my computer experiments indicate that there will be some constant where is similar to a positive semidefinite operator with eigenvalues and where the eigenvalue has multiplicity where denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR of , and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of . My computer experiments indicate that there is some constant where has spectrum where the eigenvalue has multiplicity .
In this note, I will continue to demonstrate not only the ways in which LSRDRs (-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that is the quantum channel that maps a qubit state to a qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that are by matrices where has the Kraus representation .
The objective is to locally maximize the fitness level where the norm in question is the Euclidean norm and where denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel .
Let when is selected to locally maximize the fitness level. Then my empirical calculations show that there is some where is positive semidefinite with eigenvalues and where the eigenvalue has multiplicity which is the binomial coefficient. But these are empirical calculations for select values ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training
I will make another post soon about more LSRDRs of a higher dimension of the same channel .