This is a special post for quick takes by Joseph Van Name. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
24 comments, sorted by Click to highlight new comments since:

Every entry in a matrix counts for the -spectral radius similarity. Suppose that  are real -matrices. Set . Define the -spectral radius similarity between  and  to be the number

. Then the -spectral radius similarity is always a real number in the interval , so one can think of the -spectral radius similarity as a generalization of the value  where  are real or complex vectors. It turns out experimentally that if  are random real matrices, and each  is obtained from  by replacing each entry in  with  with probability , then the -spectral radius similarity between  and  will be about . If , then observe that  as well.

Suppose now that  are random real  matrices and  are the  submatrices of  respectively obtained by only looking at the first  rows and columns of . Then the -spectral radius similarity between  and  will be about . We can therefore conclude that in some sense  is a simplified version of  that more efficiently captures the behavior of  than  does.

If  are independent random matrices with standard Gaussian entries, then the -spectral radius similarity between  and  will be about  with small variance. If  are random Gaussian vectors of length , then  will on average be about  for some constant , but  will have a high variance.

These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.

Your notation is confusing me. If r is the size of the list of matrices, then how can you have a probability of 1-r for r>=2? Maybe you mean 1-1/r and sqrt{1/r} instead of 1-r and sqrt{r} respectively?

Thanks for pointing that out. I have corrected the typo.  I simply used the symbol  for two different quantities, but now the probability is denoted by the symbol .

In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.

Suppose that  is either the field of real numbers, complex numbers, or quaternions.

If  are matrices, then define an superoperator 

 by setting

 (the domain and range of )and define . Define the L_2-spectral radius similarity  by setting

 where  denotes the spectral radius.

Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space  together with a bilinear binary operation  such that and  for all .

Suppose that  is an orthonormal basis for . Define operators  by setting . Now, define operators  up to reordering by setting 

Let  be a positive integer. Then the goal is to find complex symmetric -matrices  where  is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping  to  is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).

Experimental Observation: If , then we can easily find complex symmetric matrices  where  is locally maximized and where 

If , then we can easily find complex symmetric matrices  where  is locally maximized and where

.

It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.

Our goal is to transform a list of -matrices  into a new and simplified list of -matrices . There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for , but in other cases, we would like the matrices  to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc. 

We measure similarity between tuples of matrices using spectral radii. Suppose that  are -matrices and  are -matrices. Then define an operator  mapping  matrices to 

-matrices by setting . Then define . Define the similarity between  and  by setting

where  denotes the spectral radius. Here,  should be thought of as a generalization of the cosine similarity to tuples of matrices. And  is always a real number in , so this is a sensible notion of similarity.

Suppose that  is either the field of real or complex numbers. Let  denote the set of  by  matrices over 

Let  be positive integers. Let  denote a projection operator. Here,  is a real-linear operator, but if  is not complex, then  is not necessarily complex linear. Here are a few examples of such linear operators  that work:

  (Complex symmetric)

 (Complex anti-symmetric)

 (Complex Hermitian)

 (real, the real part taken elementwise).

 (Real symmetric)

 (Real anti-symmetric)

 (real symmetric)

 (real anti-symmetric)

Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for  along with operators that I have forgotten about.

We are now ready to describe our machine learning algorithm's input and objective.

Input: -matrices 

Objective: Our goal is to obtain matrices   where  for all  but which locally maximizes the similarity.

In this case, we shall call  an -spectral radius dimensionality reduction (LSRDR) along the subspace 

LSRDRs along subspaces often perform tricks and are very well-behaved.

If  are LSRDRs along subspaces, then there are typically some  where  for all . Furthermore, if  is an LSRDR along a subspace, then we can typically find some matrices  where for all 

The model  simplifies since it is encoded into the matrices , but this also means that the model  is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices  tend to have mathematical properties that I can't explain and am still exploring.

 I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.

We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.

Define  for all complex numbers . Then , and there are neighborhoods  of  respectively where if , then  quickly and if , then  quickly. Set . The function  serves as error correction for projection matrices since if  is nearly a projection matrix, then  will be a projection matrix.

Suppose that  is either the field of real numbers, complex numbers or quaternions. Let  denote the center of . In particular, 

If  are -matrices, then define  by setting . Then we say that an operator of the form  is completely positive. We say that a -linear operator  is Hermitian preserving if  is Hermitian whenever  is Hermitian. Every completely positive operator is Hermitian preserving.

Suppose that  is -linear. Let . Let  be a random orthogonal projection matrix of rank . Set  for all . Then if everything goes well, the sequence  will converge to a projection matrix  of rank , and the projection matrix  will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix . If  is Hermitian preserving, then the projection matrix  will typically be an orthogonal projection. This experiment performs well especially when  is completely positive or at least Hermitian preserving or nearly so. The projection matrix  will satisfy the equation .

In the case when  is a quantum channel, we can easily explain what the projection  does. The operator  is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel . In particular, the image  is spanned by the top  eigenvectors of . This means that if we send the completely mixed state  through the quantum channel  and we measure the state  with respect to the projective measurement , then there is an unusually high probability that this measurement will land on  instead of .

Let us now use the algorithm that obtains  from  to solve a problem in many cases.

If  is a vector, then let  denote the diagonal matrix where  is the vector of diagonal entries, and if  is a square matrix, then let  denote the diagonal of . If  is a length  vector, then  is an -matrix, and if  is an -matrix, then  is a length  vector.

Problem Input: An -square matrix  with non-negative real entries and a natural number  with .

Objective: Find a subset  with  and where if , then the  largest entries in  are the values  for .

Algorithm: Let  be the completely positive operator defined by setting . Then we run the iteration using  to produce an orthogonal projection  with rank . In this case, the projection  will be a diagonal projection matrix with rank  where  and where  is our desired subset of .

While the operator  is just a linear operator, the pseudodeterminism of the algorithm that produces the operator  generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

This post gives an example of some calculations that I did using my own machine learning algorithm. These calculations work out nicely which indicates that the machine learning algorithm I am using is interpretable (and it is much more interpretable than any neural network would be). These calculations show that one can begin with old mathematical structures and produce new mathematical structures, and it seems feasible to completely automate this process to continue to produce more mathematical structures. The machine learning models that I use are linear, but it seems like we can get highly non-trivial results simply by iterating the procedure of obtaining new structures from old using machine learning.

I made a similar post to this one about 7 months ago, but I decided to revisit this experiment with more general algorithms and I have obtained experimental results which I think look nice.

To illustrate how this works, we start off with the octonions. The octonions consists of an 8-dimensional inner product space  together with a bilinear operation  and a unit  where  for all  and where  for all . The octonions are uniquely determined up to isomorphism from these properties. The operation  is non-associative, but the  is closely related to the quaternions and complex numbers. If we take a single element in , then  generates a subalgebra of  isomorphic to the field of complex numbers, and if  and  are linearly independent, then  spans a subalgebra of  isomorphic to the division ring of quaternions. For this reason, one commonly thinks of the octonions as the best way to extend the division ring of quaternions to a larger algebraic structure in the same way that the quaternions extend the field of complex numbers. But since the octonions are non-associative, they cannot be used to construct matrices, so they are not as well-known as the quaternions (and the construction of the octonions is more complicated too)

Suppose now that  is an orthonormal basis for the division ring of octonions with . Then define matrices  by setting  for all . Our goal is to transform  into other tuples of matrices that satisfy similar properties.

If  are matrices, then define the 

-spectral radius similarity between  and  as

where  denotes the spectral radius,  is the tensor product, and  is the complex conjugate of  applied elementwise.

Let , and let  denote the maximum value of the fitness level  such that each  is a complex  anti-symmetric matrix (), a complex  symmetric matrix (), and a complex -Hermitian matrix () respectively.

The following calculations were obtained through gradient ascent, so I have no mathematical proof that the values obtained are actually correct.

                                    ,                                  

                                    ,                                 

 ,          ,                              

 ,                      ,                                 

,                              

 ,                      ,                                 

 ,                      

 ,                      ,                                 

Observe that with at most one exception, all of these values  are algebraic half integers. This indicates that the fitness function that we maximize to produce  behaves mathematically and can be used to produce new tuples  from old ones . Furthermore, an AI can determine whether something notable is going on with the new tuple  in several ways. For example, if  has low algebraic degree at the local maximum, then  is likely notable and likely behaves mathematically (and is probably quite interpretable too).

The good behavior of  demonstrates that the octonions are compatible with the -spectral radius similarity. The operators  are all orthogonal, and one can take the tuple  as a mixed unitary quantum channel that is very similar to the completely depolarizing channel. The completely depolarizing channel completely mixes every quantum state while the mixture of orthogonal mappings  completely mixes every real state. The -spectral radius similarity works very well with the completely depolarizing channel, so one should expect for the -spectral radius similarity to also behave well with the octonions.

Since AI interpretability is a big issue for AI safety, let's completely interpret the results of evolutionary computation. 

Disclaimer: This interpretation of the results of AI does not generalize to interpreting deep neural networks. This is a result for interpreting a solution to a very specific problem that is far less complicated than deep learning, and by interpreting, I mean that we iterate a mathematical operation hundreds of times to get an object that is simpler than our original object, so don't get your hopes up too much.

A basis matroid is a pair  where  is a finite set, and  where  denotes the power set of  that satisfies the following two properties:

  1. If , then .
  2. if , then there is some  with  (the basis exchange property).

I ran a computer experiment where I obtained a matroid  where     and where each element in  has size  through evolutionary computation, but the population size was kept so low that this evolutionary computation mimicked hill climbing algorithms. Now we need to interpret the matroid .

The notion of a matroid has many dualities. Our strategy is to apply one of these dualities to the matroid  so that the dual object is much smaller than the original object . One may formulate the notion of a matroid in terms of closure systems (flats),hyperplanes, closure operators, lattices, a rank function, independent sets, bases, and circuits. If these seems to complicated, many of these dualities are special cases of other dualities common with ordered sets. For example, the duality between closure systems, closure operators, and ordered sets applies to contexts unrelated to matroids such as in general and point-free topology. And the duality between the basis, circuit, and the hyperplanes may be characterized in terms of rowmotion.

If  is a partially ordered set, then a subset  is said to be an antichain if whenever , then . In other words, an antichain is a subset  of  where the restriction of  to  is equality. We say that a aubset  of  is downwards closed if whenever  and , then  as well. If , then let  denote the smallest downwards closed subset of  containing . Suppose that  is a finite poset. If  is an antichain in , then let  denote the set of all minimal elements in . Then  is an antichain as well, and the mapping  is a bijection from the set of all antichains in  to itself. This means that if  is an antichain, then we may define  for all integers  by setting .

If  is a basis matroid, then  is an antichain in , so we may apply rowmotion, so we say that  is an -matroid. In this case, the 

-matroids are the circuit matroids while the -matroids are the hyperplane matroids. Unfortunately, the -matroids have not been characterized for . We say that the rowmotion order of  is the least positive integer  where . If  is a matroid of order , then my computer experiments indicate that  whichs lends support to the idea that the rowmotion of a matroid is a sensible mathematical notion that may be satisfied mathematically. The notion of rowmotion of a matroid also appears to be a sensible mathematical notion for other reasons; if we apply iteratively apply a bijective operation  (such as a reversible cellular automaton) to a finite object , then that bijective operation will often increase the entropy in the sense that if  has low entropy, then  will typically have a high amount of entropy and look like noise. But this is not the case with matroids as -matroids do not appear substantially more complicated than basis matroids. Until and if there is a mundane explanation for this behavior of the rowmotion of matroids, I must consider the notion of rowmotion of matroids to be a mathematically interesting notion even though it is currently not understood by anyone.

With the matroid  obtained from evolutionary computation, I found that  has order  which factorizes as . Set . By applying rowmotion to this matroid, I found that ={{1, 8, 9},{2, 3, 6, 8},{2, 3, 7, 9},{4, 5},{4, 6, 9},{4, 7, 8},{5, 6, 9},{5, 7, 8}}. If  is a basis matroid, then , so the set  is associated with a unique basis matroid. This is the smallest way to represent  in terms of rowmotion since if , then .

I consider this a somewhat satisfactory interpretation of the matroid  that I have obtained through evolutionary computation, but there is still work to do because nobody has researched the rowmotion operation on matroids and because it would be better to simplify a matroid without needing to go through hundreds of layers of rowmotion. And even if we were to understand matroid rowmotion better, this would not help us too much with AI safety since here this interpretability of the result of evolutionary computation does not generalize to other AI's and it certainly does not apply to deep neural networks.

I made a video here where one may see the rowmotion of this matroid and that video is only slightly interpretable.

 Deep matroid duality visualization: Rowmotion of a matroid 

It turns out that evolutionary computation is not even necessary to construct matroids since Donald Knuth has produced an algorithm that can be used to construct an arbitrary matroid in his 1975 paper on random matroids. But I applied the rowmotion to the matroid in his paper and the resulting 10835-matroid ={{1, 2, 4, 5},{1, 2, 6, 10},{1, 3, 4, 6},{1, 3, 4, 7, 9},{1, 3, 6, 7, 9},{1, 4, 6, 7},{1, 4, 6, 9},{1, 4, 8, 10},{2, 3, 4, 5, 6, 7, 8, 9, 10}}. It looks like the rowmotion operation is good for simplifying matroids in general. We can uniquely recover the basis matroid from the 10835 matroid since  is not a basis matroid whenever .

I have originally developed a machine learning notion which I call an LSRDR (

-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.

Suppose that  is a natural number. Let  denote the quantum channel that takes an  qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).

If  are complex matrices, then define superoperators  and  by setting

 and  for all 

Given tuples of matrices , define the L_2-spectral radius similarity between these tuples of matrices by setting

.

Suppose now that  are matrices where . Let . We say that a tuple of complex  by  matrices  is an LSRDR of  if the quantity  is locally maximized.

Suppose now that  are complex -matrices and  is an LSRDR of . Then my computer experiments indicate that there will be some constant  where  is similar to a positive semidefinite operator with eigenvalues  and where the eigenvalue  has multiplicity  where  denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR  of , and I have plenty of other examples of interpreted LSRDRs.

We also have a similar pattern for the spectrum of . My computer experiments indicate that there is some constant  where  has spectrum  where the eigenvalue  has multiplicity .

The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.

One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.

For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.

Let  be a complex Euclidean space. Let  denote the vector space of linear operators from  to . Given complex Euclidean spaces , we say that a linear operator  from  to  is a trace preserving if 

 for all , and we say that  is completely positive if there are linear transformations  where  for all ; the value  is known as the Choi rank of . A completely positive trace preserving operator is known as a quantum channel.

The collection of quantum channels from  to  is compact and convex. 

If  is a complex Euclidean space, then let 

 denote the collection of pure states in 

 can be defined either as the set of unit vector in  modulo linear dependence, or 

 can be also defined as the collection of positive semidefinite rank- operators on  with trace .

Given complex Euclidean spaces  and a (multi) set of  distinct ordered pairs of unit vectors , and given a quantum channel

, we define the fitness level  and the loss level .

We may locally optimize  to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the  which has dimension . Due to the large dimension, any locally optimal  will contain  many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.

Empirical Observation: Suppose that  are complex Euclidean spaces,  is finite and  is a positive integer. Then computer experiments indicate that there is typically only one quantum channel  of Choi rank at most  where  is locally minimized. More formally, if we run the experiment twice and produce two quantum channels  where  is locally minimized for , then we would typically have . We therefore say that when  is minimized,  is the best Choi rank  quantum channel approximation to .

Suppose now that  is a multiset. Then we would ideally like to approximate the function  better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function  that maps a positive semidefinite matrix  to its (equivalence class of) unit dominant eigenvector.

Empirical observation: If  and  is the best Choi rank  quantum channel approximation to , then let  for all , and define . Let  be a small open neighborhood of . Let . Then we typically have . More generally, the best Choi rank  quantum channel approximation to  is typically the identity function.

From the above observation, we see that the vector  is an approximation of  that cannot be improved upon. The mapping  is therefore a trainable approximation to the mapping  and since  are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping  is a non-linear model for the function to .

I have been investigating machine learning models similar to  for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as 

I admit that machine learning models like  are currently far from being as powerful as deep neural networks, but since  behaves mathematically, the model  should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like  but which can perform more and more machine learning tasks.

(Edited 8/14/2024)

Here is an example of what might happen. Suppose that for each , we select a orthonormal basis  of unit vectors for . Let . Then

Then for each quantum channel , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have 

. Here, equality is reached if and only if 

 for each , but this equality can be achieved by the channel

defined by  which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel  has maximum Choi rank since the Choi representation of  is just the identity function divided by the rank. This example is not unexpected since for each input of  the possible outputs span the entire space  evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function  are the channels that give us a sort of linear regression of  but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.

We can use the spectral radius similarity to measure more complicated similarities between data sets.

Suppose that  are -real matrices and  are -real matrices. Let  denote the spectral radius of  and let  denote the tensor product of  with . Define the -spectral radius by setting , Define the -spectral radius similarity between  and  as

.

We observe that if  is invertible and  is a constant, then

Therefore, the -spectral radius is able to detect and measure symmetry that is normally hidden.

Example: Suppose that  are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation  with  for all  (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors  with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then  is a measure of low close we are to obtaining such an affine transformation . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that  are both the first few layers in a (typically different) neural network. Suppose that  is a set of data points. Then if  and , then  is a measure of the similarity between  and .

I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of  of ordered pairs and I trained the neural networks  to minimize the expected losses . In my experiment, each  was a random vector of length 32 whose entries were 0's and 1's. In my experiment, the similarity  was worse than if  were just random vectors.

This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).

So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.

Suppose that  is either the field of real numbers or the field of complex numbers. Let  be finite dimensional inner product spaces over  with dimensions  respectively. Suppose that  has basis . Given , we would sometimes want to approximate the tensor  with a tensor that has less parameters. Suppose that  is a sequence of natural numbers with . Suppose that  is a  matrix over the field  for  and . From the system of matrices , we obtain a tensor . If the system of matrices  locally minimizes the distance , then the tensor  is a dimensionality reduction of  which we shall denote by .

Intuition: One can associate the tensor product  with the set of all degree  homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form . Given, our matrices , we can define a linear functional  by setting . But by the Reisz representation theorem, the linear functional  is dual to some tensor in . More specifically,  is dual to . The tensors of the form  are therefore the

Advantages: 

  1. In my computer experiments, the reduced dimension tensor  is often (but not always) unique in the sense that if we calculate the tensor  twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors  will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor .
  2. This dimensionality reduction does not depend on the choice of orthonormal basis . If we chose a different basis for each , then the resulting tensor  of reduced dimensionality will remain the same (the proof is given below).
  3. If  is the field of complex numbers, but all the entries in the tensor  happen to be real numbers, then all the entries in the tensor  will also be real numbers.
  4. This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.

Disadvantages: 

  1. This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces .
  2. Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
  3. The tensors of reduced dimensionality  have a more complicated description than the tensors in the CP tensor dimensionality reduction.

Proposition: The set of tensors of the form  does not depend on the choice of bases .

Proof: For each , let  be an alternative basis for . Then suppose that  for each . Then

. Q.E.D.

A failed generalization: An astute reader may have observed that if we drop the requirement that , then we get a linear functional defined by letting

. This is indeed a linear functional, and we can try to approximate  using a the dual to , but this approach does not work as well.

In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.

Let  denote a finite dimensional algebra over the field of real numbers together with an adjoint operation  (the operation  is a linear involution with ). For example,  could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint  to the matrix ring  by setting .

Let  be a natural number. If , then define

 by setting .

Suppose now that . Then let  be the set of all -diagonal matrices with  many 's on the diagonal. We observe that each element in  is an orthogonal projection. Define fitness functions  by setting

,

, and

. Here,  denotes the spectral radius.

 is typically slightly larger than , so these three fitness functions are closely related.

If , then we say that  is in the neighborhood of  if  differs from  by at most 2 entries. If  is a fitness function with domain , then we say that  is a local maximum of the function  if  whenever  is in the neighborhood of 

The path from initialization to a local maximum   for will be a sequence  where  is always in the neighborhood of  and where  for all  and the length of the path will be  and where  is generated uniformly randomly.

Empirical observation: Suppose that . If we compute a path from initialization to local maximum for , then such a path will typically have length less than . Furthermore, if we locally maximize  multiple times, we will typically obtain the same local maximum each time. Moreover, if  are the computed local maxima of  respectively, then  will either be identical or differ by relatively few diagonal entries.

I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that  is a basis matroid with underlying set  and where  for each . Then one should be able to make the same observation about the fitness functions  as well. 

We observe that the problems of maximizing  are all NP-complete problems since the clique problems can be reduced to special cases of maximizing . This means that the problems of maximizing  can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for  in some cases.

This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.

Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.

Suppose that  are -complex matrices and  are -complex matrices. Then define a mapping  by   for all . Define

. Define the 

-spectral radius by setting . Define the -spectral radius similarity between  and  by 

.

The -spectral radius similarity is always in the interval . if  generates the algebra of -complex matrices, and  also generates the algebra of -complex matrices, then  if and only if there are  with  for all .

Define  to be the supremum of

 where  are -Hermitian matrices.

One can get lower bounds for  simply by locally maximizing  using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.

Empirical observation/conjecture: If  are -complex matrices, then  whenever .

The above observation means that sequences of -matrices  are fundamentally non-Hermitian. In this case, we cannot get better models of  using Hermitian matrices larger than the matrices  themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever 

, but the purpose of modeling  as Hermitian matrices is generally to use smaller matrices and not larger matrices. 

This means that the function  behaves mathematically.

Now, the model  is a linear model of  since the mapping  is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model  generalizes to multi-layered machine learning models.

Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.

Suppose that  is a continuous fitness function. In an ideal universe, we would like for the function  to have just one local maximum. If  has just one local maximum, we say that  is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form  for local maximum . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value. 

Set . If the set  is disconnected (in a topological sense) and if  behaves differently on each of the components of , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of  still has a unique local maximum value.

Let  be positive integers with  and where . Let  be other natural numbers. The set  is the collection of all tuples  where each  is a real -matrix and where the indices range from  and where  is not identically zero for all .

The training data is a set  that consists of input/label pairs  where  and where  such that each  is a subset of  for all  (i.e.  is a binary classifier where  is the encoded network input and  is the label).

Define . Now, we define our fitness level by setting

 where the expected value is with respect to selecting an element  uniformly at random. Here,  is a Schatten -norm which is just the -norm of the singular values of the matrix. Observe that the fitness function  only depends on the list , so  does not depend on the training data labels.

Observe that  which is a disconnected open set. Define a function  by setting . Observe that if  belong to the same component of , then .

While the fitness function  has many local maximum values, the function  seems to typically have at most one local maximum value per component. More specifically, for each , the set  seems to typically be a connected open set where  has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).

Let . Then  is a (possibly empty) open subset of , and there tends to be a unique (up-to-symmetry)  where  is locally maximized. This unique  is the machine learning model that we obtain when training on the data set . To obtain , we first perform an optimization that works well enough to get inside the open set . For example, to get inside , we could try to maximize the fitness function . We then maximize  inside the open set  to obtain our local maximum.

After training, we obtain a function  defined by . Observe that the function  is a multi-linear function. The function  is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function  has been trained so that  for each  but also so that  is large compared to what we might expect whenever . In other words,  is helpful in determining whether  belongs to  or not since one can examine the magnitude and sign of 

In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function  (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.

In this note, I will continue to demonstrate not only the ways in which LSRDRs (-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.

Suppose that  is the quantum channel that maps a  qubit state to a  qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that  are  by  matrices where  has the Kraus representation 

The objective is to locally maximize the fitness level  where the norm in question is the Euclidean norm and where  denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel .

Let  when  is selected to locally maximize the fitness level. Then my empirical calculations show that there is some  where is positive semidefinite with eigenvalues  and where the eigenvalue  has multiplicity  which is the binomial coefficient. But these are empirical calculations for select values ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).

Here, we have obtained a complete characterization of  up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function. 

I made a few YouTube videos showcasing the process of maximizing the fitness level here.

Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training

I will make another post soon about more LSRDRs of a higher dimension of the same channel .

I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.

My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let's see what happens when we take an LSRDR of the octonions.

Let  denote either the field of real numbers or the field of complex numbers (

 could also be the division ring of quaternions, but for simplicity, let's not go there). If  are -matrices over , then an LSRDR (-spectral radius dimensionality reduction) of  is a collection  of -matrices that locally maximizes the fitness level

 denotes the spectral radius function while  denotes the tensor product and  denotes the matrix obtained from  by replacing each entry with its complex conjugate. We shall call the maximum fitness level the -spectral radius of  over the field , and we shall wrote  for this spectral radius.

Define the linear superoperator  by setting 

 and set . Then the fitness level of  is .

Suppose that  is an -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation  on  together with a unit  such that and  for all x. If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result. 

We say that an octonion-like algbera is a -dimensional real inner product space  together with a unique up-to-isomorphism bilinear operation  such that  for all .

Let  be a specific octonion-like algebra.

Suppose now that  is an orthonormal basis for  (this does not need to be the standard basis). Then for each , let  be the linear operator from  to  defined by setting  for all vectors . All non-zero linear combinations of  are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis  since I could formulate everything in a coordinate-free manner.

Empirical Observation from computer calculations: Suppose that  and  is the field of real numbers. Then the following are equivalent.

  1. The  matrices  are a LSRDR of  over  where  has a unique real dominant eigenvalue.
  2. There exists matrices  where  for all  and where  is an orthonormal projection matrix.

In this case,  and this fitness level is reached by the matrices  in the above equivalent statements. Observe that the superoperator  is similar to a direct sum of   and a zero matrix. But the projection matrix  is a dominant eigenvector of  and of as well. 

I have no mathematical proof of the above fact though.

Now suppose that . Then my computer calculations yield the following complex -spectral radii:  

Each time that I have trained a complex LSRDR of , I was able to find a fitness level that is not just a local optimum but also a global optimum.

In the case of the real LSRDRs, I have a complete description of the LSRDRs of . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.

I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.

 

Edit: I have made some corrections to this post on 9/22/2024.

 

Fitness levels of complex LSRDRs of the octonions (youtube.com)

 

There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem. 

Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.

Theorem: Suppose that  is a real,complex, or quaternionic -matrix that minimizes the quantity . Then  is unitary.

Proof: The real case is a special case of a complex case, and by representing each -quaternionic matrix as a complex -matrix, we may assume that  is a complex matrix.

By the Schur decomposition, we know that  where  is a unitary matrix and  is upper triangular. But we know that . Furthermore, , so . Let  denote the diagonal matrix whose diagonal entries are the same as . Then  and . Furthermore,  iff T is diagonal and  iff  is diagonal. Therefore, since  and  is minimized, we can conclude that , so  is a diagonal matrix. Suppose that  has diagonal entries . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that 

Here, the equalities hold if and only if  for all , but this implies that  is unitary. Q.E.D.

The -spectral radius similarity is not transitive. Suppose that  are -matrices and  are real -matrices. Then define . Then the generalized Cauchy-Schwarz inequality is satisfied:

.

We therefore define the -spectral radius similarity between  and  as . One should think of the -spectral radius similarity as a generalization of the cosine similarity  between vectors . I have been using the -spectral radius similarity to develop AI systems that seem to be very interpretable. The -spectral radius similarity is not transitive.

 and

, but  can take any value in the interval .

We should therefore think of the -spectral radius similarity as a sort of least upper bound of -valued equivalence relations than a -valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.

Notation:  is the spectral radius. The spectral radius  is the largest magnitude of an eigenvalue of the matrix . Here the norm does not matter because we are taking the limit.  is the direct sum of matrices while  denotes the Kronecker product of matrices.

Let's compute some inner products and gradients.

Set up: Let  denote either the field of real or the field of complex numbers. Suppose that  are positive integers. Let  be a sequence of positive integers with . Suppose that  is an -matrix whenever . Then from the matrices , we can define a -tensor . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices .

Notation: If  are matrices, then let  denote the superoperator defined by letting . Let .

Inner product: Here is the computation of the inner product of our tensors.

.

In particular, .

Gradient: Observe that . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.

.

So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.

Suppose that  is either the field of real numbers or the field of complex numbers. Suppose that  are positive integers and  is a sequence of positive integers with . Suppose that  is an -matrix whenever . Then define a tensor 

If , and  is a system of matrices that minimizes the value , then  is a dimensionality reduction of , and we shall denote let  denote the tensor of reduced dimension . We shall call  a matrix table to tensor dimensionality reduction of type .

Observation 1: (Sparsity) If  is sparse in the sense that most entries in the tensor  are zero, then the tensor  will tend to have plenty of zero entries, but as expected,  will be less sparse than .

Observation 2: (Repeated entries) If  is sparse and  and the set  has small cardinality, then the tensor  will contain plenty of repeated non-zero entries.

Observation 3: (Tensor decomposition) Let  be a tensor. Then we can often find a matrix table to tensor dimensionality reduction  of type  so that  is its own matrix table to tensor dimensionality reduction.

Observation 4: (Rational reduction) Suppose that  is sparse and the entries in  are all integers. Then the value  is often a positive integer in both the case when  has only integer entries and in the case when  has non-integer entries.

Observation 5: (Multiple lines) Let  be a fixed positive even number. Suppose that  is sparse and the entries in  are all of the form  for some integer  and . Then the entries in  are often exclusively of the form  as well.

Observation 6: (Rational reductions) I have observed a sparse tensor  all of whose entries are integers along with matrix table to tensor dimensionality reductions  of  where .

This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.

From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.

Suppose that  are natural numbers. Let . Let  be a complex number whenever . Let  be the fitness function defined by letting . Here,  denotes the spectral radius of a matrix  while  denotes the Schatten -norm of .

Now suppose that  is a tuple that maximizes . Let  be the fitness function defined by letting . Then suppose that  is a tuple that maximizes . Then we will likely be able to find an  and a non-zero complex number  where 

In this case,  represents the training data while the matrices  is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model  without any distortion to the data values.

I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.

I think that all that happened here was the matrices  just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.

Curated and popular this week