Joseph Van Name

Wiki Contributions

Comments

Sorted by

The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.

Machine learning models from generalized Laver tables?

I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP. 

But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.

Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.

Let  be a set, and let  be a binary operation on  that satisfies the self-distributivity identity . Define the right powers  for  inductively by setting  and . We say that  is a reduced nilpotent self-distributive algebra if  satisfies the identities  for  and if for all  there is an  with . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra  where if  for each , then  for some . Here we make the convention to put the implied parentheses on the left so that .

Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If  is a self-distributive algebra, then we can define  and in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define  precisely when  and  precisely when . We can define  to be the equivalence class of  with respect to  and . The operations on  are and  whenever we have our composition operation . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra  that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra  where  are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but  but there is a surjective homomorphism . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.

Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.

From Laver-like algebras, we can also produce non-commutative polynomials. If  is a Laver-like algebra, and  is a generating set for , then for each non-maximal critical point , we can define a non-commutative polynomial  to be the sum of all non-commutative monomials of the form  where  and  but where  for . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions. 

I have used these non-commutative polynomials to construct the infinite product formula

 where the polynomials  are obtained from  different non-trivial rank-into-rank embeddings . This means that the non-commutative ring operations  in the rings of non-commutative polynomials are meaningful for Laver-like algebras.

If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.

Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.

Using deep learning to investigate generalized Laver tables

We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.

The n-th classical Laver table is the unique self-distributive algebraic structure  where whenever . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.

Problem: Compute the -th classical Laver table for as large  as we can achieve.

Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table  and with machine learning, I have personally gotten up to  (I think I have a couple of errors, but those are not too significant), and I have some of the computation of .

Suppose now that . Then Dougherty has shown that one can easily recover the entire function  from the restriction  where  for all . Suppose now that . Then let  denote the least natural number  such that . Then there is a number  called the threshold of  at  such that  and  and such that for every  with , we have  whenever  and for . The classical Laver table  can be easily recovered from the table  and the threshold function . To go beyond Dougherty's calculation of  to  and beyond, we exploit a couple of observations about the threshold function :

i: in most cases, , and

ii: in most cases,  for some .

Let , and let  where  is the least natural number where  and . Then one can easily compute  from  and the data  by using Dougherty's algorithm. Now the only thing that we need to do is compute  in the first place. It turns out that computing  is quite easy except when  is of the form  or  for some . In the case, when  or , we initially set  We then repeatedly modify  until we can no longer find any contradiction in the statement . Computing  essentially amounts to finding new elements in the set  and we find new elements in  using some form of machine learning.

In my calculation of , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in  to find new elements in , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.

My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in  belong to  and  has a natural tensor product structure. One can therefore consider  to be a subset of . Furthermore, a tuple of  elements in  can be considered as a subset of . This means that the linear layers from  to  should be Kronecker products  or Kronecker sums  (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices  are defined by setting  and  respectively. Perhaps neural networks can be used to generate new elements in . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in  have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute  without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of  since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.

Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.

Solution: The non-commutative polynomials  are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials  into unique sequences of real matrices  (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices  into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.

Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.

Solution: The set of all multigenic Laver tables over an alphabet  forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.

I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.

Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.

I have originally developed a machine learning notion which I call an LSRDR (

-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.

Suppose that  is a natural number. Let  denote the quantum channel that takes an  qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).

If  are complex matrices, then define superoperators  and  by setting

 and  for all 

Given tuples of matrices , define the L_2-spectral radius similarity between these tuples of matrices by setting

.

Suppose now that  are matrices where . Let . We say that a tuple of complex  by  matrices  is an LSRDR of  if the quantity  is locally maximized.

Suppose now that  are complex -matrices and  is an LSRDR of . Then my computer experiments indicate that there will be some constant  where  is similar to a positive semidefinite operator with eigenvalues  and where the eigenvalue  has multiplicity  where  denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR  of , and I have plenty of other examples of interpreted LSRDRs.

We also have a similar pattern for the spectrum of . My computer experiments indicate that there is some constant  where  has spectrum  where the eigenvalue  has multiplicity .

In this note, I will continue to demonstrate not only the ways in which LSRDRs (-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.

Suppose that  is the quantum channel that maps a  qubit state to a  qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that  are  by  matrices where  has the Kraus representation 

The objective is to locally maximize the fitness level  where the norm in question is the Euclidean norm and where  denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel .

Let  when  is selected to locally maximize the fitness level. Then my empirical calculations show that there is some  where is positive semidefinite with eigenvalues  and where the eigenvalue  has multiplicity  which is the binomial coefficient. But these are empirical calculations for select values ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).

Here, we have obtained a complete characterization of  up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function. 

I made a few YouTube videos showcasing the process of maximizing the fitness level here.

Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training

I will make another post soon about more LSRDRs of a higher dimension of the same channel .

I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.

My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let's see what happens when we take an LSRDR of the octonions.

Let  denote either the field of real numbers or the field of complex numbers (

 could also be the division ring of quaternions, but for simplicity, let's not go there). If  are -matrices over , then an LSRDR (-spectral radius dimensionality reduction) of  is a collection  of -matrices that locally maximizes the fitness level

 denotes the spectral radius function while  denotes the tensor product and  denotes the matrix obtained from  by replacing each entry with its complex conjugate. We shall call the maximum fitness level the -spectral radius of  over the field , and we shall wrote  for this spectral radius.

Define the linear superoperator  by setting 

 and set . Then the fitness level of  is .

Suppose that  is an -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation  on  together with a unit  such that and  for all x. If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result. 

We say that an octonion-like algbera is a -dimensional real inner product space  together with a unique up-to-isomorphism bilinear operation  such that  for all .

Let  be a specific octonion-like algebra.

Suppose now that  is an orthonormal basis for  (this does not need to be the standard basis). Then for each , let  be the linear operator from  to  defined by setting  for all vectors . All non-zero linear combinations of  are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis  since I could formulate everything in a coordinate-free manner.

Empirical Observation from computer calculations: Suppose that  and  is the field of real numbers. Then the following are equivalent.

  1. The  matrices  are a LSRDR of  over  where  has a unique real dominant eigenvalue.
  2. There exists matrices  where  for all  and where  is an orthonormal projection matrix.

In this case,  and this fitness level is reached by the matrices  in the above equivalent statements. Observe that the superoperator  is similar to a direct sum of   and a zero matrix. But the projection matrix  is a dominant eigenvector of  and of as well. 

I have no mathematical proof of the above fact though.

Now suppose that . Then my computer calculations yield the following complex -spectral radii:  

Each time that I have trained a complex LSRDR of , I was able to find a fitness level that is not just a local optimum but also a global optimum.

In the case of the real LSRDRs, I have a complete description of the LSRDRs of . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.

I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.

 

Edit: I have made some corrections to this post on 9/22/2024.

 

Fitness levels of complex LSRDRs of the octonions (youtube.com)

 

Here is an example of what might happen. Suppose that for each , we select a orthonormal basis  of unit vectors for . Let . Then

Then for each quantum channel , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have 

. Here, equality is reached if and only if 

 for each , but this equality can be achieved by the channel

defined by  which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel  has maximum Choi rank since the Choi representation of  is just the identity function divided by the rank. This example is not unexpected since for each input of  the possible outputs span the entire space  evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function  are the channels that give us a sort of linear regression of  but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.

The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.

One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.

For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.

Let  be a complex Euclidean space. Let  denote the vector space of linear operators from  to . Given complex Euclidean spaces , we say that a linear operator  from  to  is a trace preserving if 

 for all , and we say that  is completely positive if there are linear transformations  where  for all ; the value  is known as the Choi rank of . A completely positive trace preserving operator is known as a quantum channel.

The collection of quantum channels from  to  is compact and convex. 

If  is a complex Euclidean space, then let 

 denote the collection of pure states in 

 can be defined either as the set of unit vector in  modulo linear dependence, or 

 can be also defined as the collection of positive semidefinite rank- operators on  with trace .

Given complex Euclidean spaces  and a (multi) set of  distinct ordered pairs of unit vectors , and given a quantum channel

, we define the fitness level  and the loss level .

We may locally optimize  to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the  which has dimension . Due to the large dimension, any locally optimal  will contain  many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.

Empirical Observation: Suppose that  are complex Euclidean spaces,  is finite and  is a positive integer. Then computer experiments indicate that there is typically only one quantum channel  of Choi rank at most  where  is locally minimized. More formally, if we run the experiment twice and produce two quantum channels  where  is locally minimized for , then we would typically have . We therefore say that when  is minimized,  is the best Choi rank  quantum channel approximation to .

Suppose now that  is a multiset. Then we would ideally like to approximate the function  better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function  that maps a positive semidefinite matrix  to its (equivalence class of) unit dominant eigenvector.

Empirical observation: If  and  is the best Choi rank  quantum channel approximation to , then let  for all , and define . Let  be a small open neighborhood of . Let . Then we typically have . More generally, the best Choi rank  quantum channel approximation to  is typically the identity function.

From the above observation, we see that the vector  is an approximation of  that cannot be improved upon. The mapping  is therefore a trainable approximation to the mapping  and since  are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping  is a non-linear model for the function to .

I have been investigating machine learning models similar to  for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as 

I admit that machine learning models like  are currently far from being as powerful as deep neural networks, but since  behaves mathematically, the model  should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like  but which can perform more and more machine learning tasks.

(Edited 8/14/2024)

There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem. 

Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.

Theorem: Suppose that  is a real,complex, or quaternionic -matrix that minimizes the quantity . Then  is unitary.

Proof: The real case is a special case of a complex case, and by representing each -quaternionic matrix as a complex -matrix, we may assume that  is a complex matrix.

By the Schur decomposition, we know that  where  is a unitary matrix and  is upper triangular. But we know that . Furthermore, , so . Let  denote the diagonal matrix whose diagonal entries are the same as . Then  and . Furthermore,  iff T is diagonal and  iff  is diagonal. Therefore, since  and  is minimized, we can conclude that , so  is a diagonal matrix. Suppose that  has diagonal entries . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that 

Here, the equalities hold if and only if  for all , but this implies that  is unitary. Q.E.D.

I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.

You are judging my reasoning without knowing all that went into my reasoning. That is not good.

I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics.  And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.

Load More