Joseph Van Name - LessWrong

The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.

Machine learning models from generalized Laver tables?

I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.

But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.

Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.

Let $X$ be a set, and let $*$ be a binary operation on $X$ that satisfies the self-distributivity identity $x * (y * z) = (x * y) * (x * z)$ . Define the right powers $x^{[n]}$ for $n \geq 1$ inductively by setting $x^{[1]} = x$ and $x^{[n + 1]} = x * x^{[n]}$ . We say that $(X, *, 1)$ is a reduced nilpotent self-distributive algebra if $1$ satisfies the identities $1 * x = x, x * 1 = 1$ for $x \in X$ and if for all $x \in X$ there is an $n \geq 1$ with $x^{[n]} = 1$ . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra $(X, *)$ where if $x_{n} \in X$ for each $n \geq 0$ , then $x_{0} * \dots * x_{N} = 1$ for some $N$ . Here we make the convention to put the implied parentheses on the left so that $a * b * c * d = ((a * b) * c) * d$ .

Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If $(X, *)$ is a self-distributive algebra, then we can define $x *_{0} y = y, x *_{n + 1} y = x *_{n} (x * y) = x * (x *_{n} y)$ and $x *_{\infty} y = {lim}_{n \to \infty} x *_{n} y$ in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define $x ⪯ y$ precisely when $x *_{\infty} y = 1$ and $x ≃ y$ precisely when $x ⪯ y ⪯ x$ . We can define $crit (x)$ to be the equivalence class of $x$ with respect to $≃$ and $crit [X] = X / ≃$ . The operations on $crit [X]$ are $crit (x) \to crit (y) = crit (x *_{\infty} y)$ and $crit (x) \land crit (y) = crit (x \circ y)$ whenever we have our composition operation $\circ$ . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra $(X, *)$ that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra $(Y, *)$ where $X, Y$ are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but $| Y | >> | X |$ but there is a surjective homomorphism $ϕ : Y \to X$ . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.

Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.

From Laver-like algebras, we can also produce non-commutative polynomials. If $(X, *)$ is a Laver-like algebra, and $(x_{a})_{a \in A}$ is a generating set for $X$ , then for each non-maximal critical point $α \in crit [X]$ , we can define a non-commutative polynomial $p_{α} ((x_{a})_{a \in A})$ to be the sum of all non-commutative monomials of the form $x_{a_{1}} \dots x_{a_{n}}$ where $a_{1} \dots a_{n} \in A^{*}$ and $crit (a_{1} * \dots * a_{n}) = α$ but where $crit (a_{1} * \dots * a_{m}) < α$ for $1 \leq m < n$ . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.

I have used these non-commutative polynomials to construct the infinite product formula

$\dots p_{r} (x_{1}, \dots, x_{n}) \dots p_{0} (x_{1}, \dots, x_{n}) = \frac{1}{1 - (x_{1} + \dots + x_{n})}$ where the polynomials $p_{n} (x_{1}, \dots, x_{n})$ are obtained from $n$ different non-trivial rank-into-rank embeddings $j_{1}, \dots, j_{n} : V_{λ} \to V_{λ}$ . This means that the non-commutative ring operations $+, \cdot$ in the rings of non-commutative polynomials are meaningful for Laver-like algebras.

If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.

Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.

Using deep learning to investigate generalized Laver tables

We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.

The n-th classical Laver table is the unique self-distributive algebraic structure $A_{n} = ({1, \dots, 2^{n}}, *_{n})$ where $x *_{n} 1 = x + 1 mod 2^{n}$ whenever $x \in {1, \dots, 2^{n}}$ . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.

Problem: Compute the $n$ -th classical Laver table for as large $n$ as we can achieve.

Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table $A_{48}$ and with machine learning, I have personally gotten up to $A_{768}$ (I think I have a couple of errors, but those are not too significant), and I have some of the computation of $A_{1024}$ .

Suppose now that $2^{N} \leq n \leq 3 \cdot 2^{N}$ . Then Dougherty has shown that one can easily recover the entire function $*_{n}$ from the restriction $L_{N, n} : {1, \dots, 2^{2^{N}}} \times {1, \dots, 2^{n}} \to {1, \dots, 2^{n}}$ where $L_{N, n} (x, y) = x *_{n} y$ for all $x, y$ . Suppose now that $1 \leq x \leq 2^{n}$ . Then let $o_{n} (x)$ denote the least natural number $m$ such that $x *_{n} 2^{m} = 2^{n}$ . Then there is a number $θ_{n + 1} (x)$ called the threshold of $x$ at $A_{n + 1}$ such that $0 \leq θ_{n + 1} (x) \leq 2^{o_{n} (x)}$ and $θ_{n + 1} (2^{n}) = 0$ and such that for every $y$ with $1 \leq y \leq 2^{n}$ , we have $x *_{n + 1} y = x *_{n} y$ whenever $1 \leq y \leq θ_{n + 1} (x)$ and $x *_{n + 1} y = (x *_{n} y) + 2^{n}$ for $θ_{n + 1} (x) < y \leq 2^{o_{n} (x)}$ . The classical Laver table $A_{n + 1}$ can be easily recovered from the table $A_{n}$ and the threshold function $θ_{n + 1}$ . To go beyond Dougherty's calculation of $A_{48}$ to $A_{768}$ and beyond, we exploit a couple of observations about the threshold function $θ_{n + 1}$ :

i: in most cases, $θ_{n + 2} (x) = θ_{n + 1} (x)$ , and

ii: in most cases, $θ_{n + 1} (x) = 2^{i} - 2^{j}$ for some $i, j$ .

Let $E (n) = {x \in {1, \dots, 2^{n - 2}} : θ_{n - 1} (x) \neq θ_{n} (x)}$ , and let $E^{♯} (n) = E (n) \cap {1, \dots, 2^{N} - 2}$ where $N$ is the least natural number where $n \leq 3 \cdot 2^{N}$ and $T_{n} = {(x, θ_{n} (x)) : x \in E^{♯} (n)}$ . Then one can easily compute $A_{n}$ from $A_{n - 1}$ and the data $T_{n}$ by using Dougherty's algorithm. Now the only thing that we need to do is compute $T_{n}$ in the first place. It turns out that computing $T_{n}$ is quite easy except when $n$ is of the form $2^{r} + 1$ or $3 \cdot 2^{r} + 1$ for some $r$ . In the case, when $n = 2^{r} + 1$ or $n = 3 \cdot 2^{r} + 1$ , we initially set $T_{n}^{temp} = \emptyset .$ We then repeatedly modify $T_{n}^{temp}$ until we can no longer find any contradiction in the statement $T_{n} = T_{n}^{temp}$ . Computing $T_{n}$ essentially amounts to finding new elements in the set $E^{♯} (n)$ and we find new elements in $E^{♯} (n)$ using some form of machine learning.

In my calculation of $A_{768}$ , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in $E^{♯} (n)$ to find new elements in $E^{♯} (n)$ , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.

My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in $E^{♯} (n)$ belong to ${1, \dots, 2^{N}}$ and ${1, \dots, 2^{N}}$ has a natural tensor product structure. One can therefore consider $E^{♯} (n)$ to be a subset of $(R^{2})^{\otimes n}$ . Furthermore, a tuple of $r$ elements in $E^{♯} (n)$ can be considered as a subset of $(R^{2}))^{\otimes n} \otimes R^{r}$ . This means that the linear layers from $(R^{2}))^{\otimes n} \otimes R^{r}$ to $(R^{2}))^{\otimes n} \otimes R^{r}$ should be Kronecker products $A_{1} \otimes \dots \otimes A_{n + 1}$ or Kronecker sums $A_{1} \oplus \dots \oplus A_{n + 1}$ (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices $A, B$ are defined by setting $(A \otimes B) (u \otimes v) = A u \otimes B v$ and $A \oplus B = (A \otimes 1) + (1 \otimes B)$ respectively. Perhaps neural networks can be used to generate new elements in $E^{♯} (n)$ . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in $E^{♯} (n)$ have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute $A_{768}$ without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of $E^{♯} (n)$ since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.

Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.

Solution: The non-commutative polynomials $p_{0}, \dots, p_{n}$ are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials $p_{0}, \dots, p_{n}$ into unique sequences of real matrices $A_{0}, \dots, A_{n}$ (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices $A_{0}, \dots, A_{n}$ into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.

Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.

Solution: The set of all multigenic Laver tables over an alphabet $A$ forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.

I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.

Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.

Joseph Van Name's Shortform

Joseph Van Name1mo30

I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.

We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.

Define for all complex numbers $z$ . Then $f (0) = 0, f (1) = 1, f^{'} (0) = f^{'} (1) = 0$ , and there are neighborhoods $U, V$ of $0, 1$ respectively where if $x \in U$ , then $f^{N} (x) \to 0$ quickly and if $y \in V$ , then $f^{N} (y) \to 1$ quickly. Set $f^{\infty} = {lim}_{N \to \infty} f^{N}$ . The function $f^{\infty}$ serves as error correction for projection matrices since if $Q$ is nearly a projection matrix, then $f^{\infty} (Q)$ will be a projection matrix.

Suppose that $K$ is either the field of real numbers, complex numbers or quaternions. Let $Z (K)$ denote the center of $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .

If $A_{1}, \dots, A_{r}$ are $m \times n$ -matrices, then define $Φ (A_{1}, \dots, A_{r}) : M_{n} (K) \to M_{m} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ . Then we say that an operator of the form $Φ (A_{1}, \dots, A_{r})$ is completely positive. We say that a $Z (K)$ -linear operator $E : M_{n} (K) \to M_{m} (K)$ is Hermitian preserving if $E (X)$ is Hermitian whenever $X$ is Hermitian. Every completely positive operator is Hermitian preserving.

Suppose that $E : M_{n} (K) \to M_{n} (K)$ is $Z (K)$ -linear. Let $t > 0$ . Let $P_{0} \in M_{n} (K)$ be a random orthogonal projection matrix of rank $d$ . Set $P_{N + 1} = f^{\infty} (P_{N} + t \cdot E (P_{N}))$ for all $N$ . Then if everything goes well, the sequence $(P_{N})_{N}$ will converge to a projection matrix $P$ of rank $d$ , and the projection matrix $P$ will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix $P$ . If $E$ is Hermitian preserving, then the projection matrix $P$ will typically be an orthogonal projection. This experiment performs well especially when $E$ is completely positive or at least Hermitian preserving or nearly so. The projection matrix $P$ will satisfy the equation $P \cdot E (P) = E (P) \cdot P = P \cdot E (P) \cdot P$ .

In the case when $E$ is a quantum channel, we can easily explain what the projection $P$ does. The operator $P$ is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel $E$ . In particular, the image $Im (P)$ is spanned by the top $d$ eigenvectors of $E (P)$ . This means that if we send the completely mixed state $P / d$ through the quantum channel $E$ and we measure the state $E (P / d)$ with respect to the projective measurement $(P, I - P)$ , then there is an unusually high probability that this measurement will land on $P$ instead of $I - P$ .

Let us now use the algorithm that obtains $P$ from $E$ to solve a problem in many cases.

If $x$ is a vector, then let $Diag (x)$ denote the diagonal matrix where $x$ is the vector of diagonal entries, and if $X$ is a square matrix, then let $Diag (X)$ denote the diagonal of $X$ . If $x$ is a length $n$ vector, then $Diag (x)$ is an $n \times n$ -matrix, and if $X$ is an $n \times n$ -matrix, then $Diag (X)$ is a length $n$ vector.

Problem Input: An $n \times n$ -square matrix $A$ with non-negative real entries and a natural number $d$ with $1 \leq d < n$ .

Objective: Find a subset $B \subseteq {1, \dots, n}$ with $| B | = d$ and where if $x = A \cdot χ_{B}$ , then the $d$ largest entries in $x$ are the values $x [b]$ for $b \in B$ .

Algorithm: Let $E$ be the completely positive operator defined by setting $E (X) = Diag (A \cdot Diag (X))$ . Then we run the iteration using $E$ to produce an orthogonal projection $P$ with rank $d$ . In this case, the projection $P$ will be a diagonal projection matrix with rank $d$ where $diag (P) = χ_{B}$ and where $B$ is our desired subset of ${1, \dots, n}$ .

While the operator $P$ is just a linear operator, the pseudodeterminism of the algorithm that produces the operator $P$ generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

Spectral radii dimensionality reduction computed without gradient calculations

Joseph Van Name1mo*10

I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don't think the same way as I do.

Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs

Stuff that is popular is usually garbage.

But here is my post about the word embedding.

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong

And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.

Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.

Spectral radii dimensionality reduction computed without gradient calculations

Joseph Van Name1mo*80

In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians.

LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post:

Pseudodeterministic models do not contain any extra random or even pseudorandom information that is not contained in the training data already. This means that when interpreting these models, one does not have to interpret random information.
Pseudodeterministic models inherit the symmetry of their training data. For example, if we train a real LSRDR using real symmetric matrices, then the projection will itself by a symmetric matrix.
In mathematics, a well-posed problem is a problem where there exists a unique solution to the problem. Well-posed problems behave better than ill-posed problems in the sense that it is easier to prove results about well-posed problems than it is to prove results about ill-posed problems.

In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts:

Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices — LessWrong

When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong

I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities of generalized LSRDRs so they behave more like deep neural networks, but I still have some work to expand their capabilities while retaining pseudodeterminism. In the meantime, generalized LSRDRs may still function as narrow AI for specific problems and also as layers in AI.

Of course, if we want to compare capabilities, we should also compare NNs to LSRDRs at tasks such as evaluating the cryptographic security of block ciphers, solving NP-complete problems in the average case, etc.

As for the difficulty of this post, it seems like that is the result of the post being mathematical. But going through this kind of mathematics so that we obtain inherently interpretable AI should be the easier portion of AI interpretability. I would much rather communicate about the actual mathematics than about how difficult the mathematics is.

Joseph Van Name's Shortform

Joseph Van Name2mo10

In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.

Let denote a finite dimensional algebra over the field of real numbers together with an adjoint operation $*$ (the operation $*$ is a linear involution with $(x y)^{*} = y^{*} x^{*}$ ). For example, $K$ could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint $*$ to the matrix ring $M_{r} (K)$ by setting $(x_{i, j})_{i, j}^{*} = (x_{j, i}^{*})_{i, j}$ .

Let $n$ be a natural number. If $A_{1}, \dots, A_{r} \in M_{n} (K), X_{1}, \dots, X_{r} \in M_{d} (K)$ , then define

$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by setting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ .

Suppose now that $1 \leq d < n$ . Then let $S_{d} \subseteq M_{n, n} (K)$ be the set of all $0, 1$ -diagonal matrices with $d$ many $1$ 's on the diagonal. We observe that each element in $S_{d}$ is an orthogonal projection. Define fitness functions $F_{d}, G_{d}, H_{d} : S_{d} \to R$ by setting

$F_{d} (P) = ρ (Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P))$ ,

$G_{d} (P) = ρ (Γ (P A_{1} P, \dots, P A_{r} P; P A_{1} P, \dots, P A_{r} P))$ , and

$H_{d} (P) = \frac{F_{d} (P)^{2}}{G_{d} (P)}$ . Here, $ρ$ denotes the spectral radius.

$F_{d} (P)$ is typically slightly larger than $G_{d} (P)$ , so these three fitness functions are closely related.

If $P, Q \in S_{d}$ , then we say that $Q$ is in the neighborhood of $P$ if $Q$ differs from $P$ by at most 2 entries. If $F$ is a fitness function with domain $S_{d}$ , then we say that $(P, F (P))$ is a local maximum of the function $F$ if $F (P) \geq F (Q)$ whenever $Q$ is in the neighborhood of $P$ .

The path from initialization to a local maximum $(P_{s}, F (P_{s}))$ for will be a sequence $(P_{0}, \dots, P_{s})$ where $P_{j}$ is always in the neighborhood of $P_{j - 1}$ and where $F (P_{j}) \geq F (P_{j - 1})$ for all $j$ and the length of the path will be $s$ and where $P_{0}$ is generated uniformly randomly.

Empirical observation: Suppose that $F \in {F_{d}, G_{d}, H_{d}}$ . If we compute a path from initialization to local maximum for $F$ , then such a path will typically have length less than $n$ . Furthermore, if we locally maximize $F$ multiple times, we will typically obtain the same local maximum each time. Moreover, if $P_{F}, P_{G}, P_{H}$ are the computed local maxima of $F_{d}, G_{d}, H_{d}$ respectively, then $P_{F}, P_{G}, P_{H}$ will either be identical or differ by relatively few diagonal entries.

I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that $M$ is a basis matroid with underlying set ${1, \dots, n}$ and where $| A | = d$ for each $A \in M$ . Then one should be able to make the same observation about the fitness functions $F_{d} |_{M}, G_{d} |_{M}, H_{d} |_{M}$ as well.

We observe that the problems of maximizing $F_{d}, G_{d}, H_{d}$ are all NP-complete problems since the clique problems can be reduced to special cases of maximizing $F_{d}, G_{d}, H_{d}$ . This means that the problems of maximizing $F_{d}, G_{d}, H_{d}$ can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for $F_{d}, G_{d}, H_{d}$ in some cases.

Joseph Van Name's Shortform

Joseph Van Name2mo10

This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.

Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.

Suppose that are $n \times n$ -complex matrices and $X_{1}, \dots, X_{r}$ are $d \times d$ -complex matrices. Then define a mapping $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (C) \to M_{n, d} (C)$ by $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ for all $X$ . Define

$Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$

-spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by

$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (X_{1}, \dots, X_{r})}$ .

The $L_{2}$ -spectral radius similarity is always in the interval $[0, 1]$ . if $n = d$ , $A_{1}, \dots, A_{r}$ generates the algebra of $n \times n$ -complex matrices, and $X_{1}, \dots, X_{r}$ also generates the algebra of $n \times n$ -complex matrices, then $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2} = 1$ if and only if there are $C, λ$ with $A_{j} = λ C X_{j} C^{- 1}$ for all $j$ .

Define $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ to be the supremum of

$ρ_{2} (A_{1}, \dots, A_{r}) \cdot ∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

where $X_{1}, \dots, X_{r}$ are $d \times d$ -Hermitian matrices.

One can get lower bounds for $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ simply by locally maximizing $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.

Empirical observation/conjecture: If $(A_{1}, \dots, A_{r})$ are $n \times n$ -complex matrices, then $ρ_{2, n}^{H} (A_{1}, \dots, A_{r}) = ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ whenever $d \geq n$ .

The above observation means that sequences of $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ are fundamentally non-Hermitian. In this case, we cannot get better models of $(A_{1}, \dots, A_{r})$ using Hermitian matrices larger than the matrices $(A_{1}, \dots, A_{r})$ themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever $d \geq n$

, but the purpose of modeling $(A_{1}, \dots, A_{r})$ as Hermitian matrices is generally to use smaller matrices and not larger matrices.

This means that the function $ρ_{2, d}^{H}$ behaves mathematically.

Now, the model $(X_{1}, \dots, X_{r})$ is a linear model of $(A_{1}, \dots, A_{r})$ since the mapping $A_{j} \mapsto X_{j}$ is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model $(X_{1}, \dots, X_{r})$ generalizes to multi-layered machine learning models.

Joseph Van Name's Shortform

Joseph Van Name2mo90

In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.

Suppose that is either the field of real numbers, complex numbers, or quaternions.

If $A_{1}, \dots, A_{r} \in M_{m} (K), B_{1}, \dots, B_{r} \in M_{n} (K)$ are matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) : M_{m, n} (K) \to M_{m, n} (K)$

by setting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$

(the domain and range of )and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the L_2-spectral radius similarity $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$ by setting

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ where $ρ$ denotes the spectral radius.

Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space $V$ together with a bilinear binary operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = x$ for all $x, y \in V$ .

Suppose that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ . Define operators $(A_{1}, \dots, A_{8})$ by setting $A_{i} v = e_{j} * v$ . Now, define operators $(B_{1}, \dots, B_{64})$ up to reordering by setting ${B_{1}, \dots, B_{64}} = {A_{i} \otimes A_{j} : i, j \in {1, \dots, 8}}$ .

Let $d$ be a positive integer. Then the goal is to find complex symmetric $d \times d$ -matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping $A_{j}$ to $X_{j}$ is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).

Experimental Observation: If $1 \leq d \leq 8$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 6) / 64 = (d + 3) / 32.$

If $7 \leq d \leq 16$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 4) / 64 = (d + 2) / 32.$

Joseph Van Name's Shortform

Joseph Van Name2mo10

Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.

Suppose that is a continuous fitness function. In an ideal universe, we would like for the function $F$ to have just one local maximum. If $F$ has just one local maximum, we say that $F$ is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form $F (x)$ for local maximum $(x, F (x))$ . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value.

Set $O = F^{- 1} [(- \infty, \infty)] = X ∖ F^{- 1} [{- \infty}]$ . If the set $O$ is disconnected (in a topological sense) and if $L$ behaves differently on each of the components of $L$ , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of $O$ still has a unique local maximum value.

Let $m_{0}, \dots, m_{n}$ be positive integers with $m_{0} = m_{n} = 1$ and where $m_{1} \geq 1, \dots, m_{n - 1} \geq 1$ . Let $r_{0}, \dots, r_{n - 1}$ be other natural numbers. The set $X$ is the collection of all tuples $A = (A_{i, j})_{i, j}$ where each $A_{i, j}$ is a real $m_{i + 1} \times m_{i}$ -matrix and where the indices range from $i \in {0, \dots, n - 1}, j \in {1, \dots, r_{i}}$ and where $(A_{i, j})_{j}$ is not identically zero for all $i$ .

The training data is a set $Σ$ that consists of input/label pairs $(u, v)$ where $v \in {- 1, 1}$ and where $u = (u_{0}, \dots, u_{n - 1})$ such that each $u_{i}$ is a subset of ${1, \dots, r_{i}}$ for all $i$ (i.e. $Σ$ is a binary classifier where $u$ is the encoded network input and $v$ is the label).

Define $W (u, A) = (\sum_{j \in u_{n - 1}} A_{n - 1, j}) \dots (\sum_{j \in u_{0}} A_{0, j})$ . Now, we define our fitness level by setting

$F (A) = \sum_{(u, v) \in Σ} log (| W (u, A) |) / | Σ | - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$

$= E (log (| W (u, A) |)) - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$ where the expected value is with respect to selecting an element $(u, v) \in Σ$ uniformly at random. Here, $∥ * ∥_{p}$ is a Schatten $p$ -norm which is just the $ℓ_{p}$ -norm of the singular values of the matrix. Observe that the fitness function $F$ only depends on the list $(u : (u, v) \in Σ)$ , so $F$ does not depend on the training data labels.

Observe that $O = X ∖ ⋃_{u \in Σ} {A \in X : W (u, A) = 0}$ which is a disconnected open set. Define a function $f : O \to {- 1, 1}^{Σ}$ by setting $f (A) = (W (u, A) / | W (u, A) |)_{(u, v) \in Σ}$ . Observe that if $x, y$ belong to the same component of $O$ , then $f (x) = f (y)$ .

While the fitness function $F$ has many local maximum values, the function $F$ seems to typically have at most one local maximum value per component. More specifically, for each $(α_{i})_{i \in Σ}$ , the set $f^{- 1} [{(α_{i})_{i \in Σ}}]$ seems to typically be a connected open set where $F$ has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).

Let $Ω = f^{- 1} [{(v)_{(u, v) \in Σ}]$ . Then $Ω$ is a (possibly empty) open subset of $O$ , and there tends to be a unique (up-to-symmetry) $A_{0} \in Ω$ where $F (A_{0})$ is locally maximized. This unique $A_{0}$ is the machine learning model that we obtain when training on the data set $Σ$ . To obtain $A_{0}$ , we first perform an optimization that works well enough to get inside the open set $Ω$ . For example, to get inside $Ω$ , we could try to maximize the fitness function $\sum_{(u, v) \in Σ} arctan (v \cdot W (u, A))$ . We then maximize $F$ inside the open set $Ω$ to obtain our local maximum.

After training, we obtain a function $f$ defined by $f (u) = W (u, A_{0})$ . Observe that the function $f$ is a multi-linear function. The function $f$ is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function $f$ has been trained so that $f (u) / | f (u) | = v$ for each $(u, v) \in Σ$ but also so that $| f (u) |$ is large compared to what we might expect whenever $(u, v) \in Σ$ . In other words, $f$ is helpful in determining whether $(u, v)$ belongs to $Σ$ or not since one can examine the magnitude and sign of $f (u)$ .

In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function $f$ (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.

Joseph Van Name's Shortform

Joseph Van Name3mo30

This post gives an example of some calculations that I did using my own machine learning algorithm. These calculations work out nicely which indicates that the machine learning algorithm I am using is interpretable (and it is much more interpretable than any neural network would be). These calculations show that one can begin with old mathematical structures and produce new mathematical structures, and it seems feasible to completely automate this process to continue to produce more mathematical structures. The machine learning models that I use are linear, but it seems like we can get highly non-trivial results simply by iterating the procedure of obtaining new structures from old using machine learning.

I made a similar post to this one about 7 months ago, but I decided to revisit this experiment with more general algorithms and I have obtained experimental results which I think look nice.

To illustrate how this works, we start off with the octonions. The octonions consists of an 8-dimensional inner product space together with a bilinear operation $*$ and a unit $1 \in V$ where $1 * v = v * 1 = v$ for all $v \in V$ and where $∥ u * v ∥ = ∥ u ∥ \cdot ∥ v ∥$ for all $u, v \in V$ . The octonions are uniquely determined up to isomorphism from these properties. The operation $*$ is non-associative, but the $*$ is closely related to the quaternions and complex numbers. If we take a single element in $v \in V ∖ Span (1)$ , then ${1, v}$ generates a subalgebra of $(V, *)$ isomorphic to the field of complex numbers, and if $u, v \in V$ and ${1, u, v}$ are linearly independent, then ${1, u, v, u * v}$ spans a subalgebra of $V$ isomorphic to the division ring of quaternions. For this reason, one commonly thinks of the octonions as the best way to extend the division ring of quaternions to a larger algebraic structure in the same way that the quaternions extend the field of complex numbers. But since the octonions are non-associative, they cannot be used to construct matrices, so they are not as well-known as the quaternions (and the construction of the octonions is more complicated too)

Suppose now that ${e_{0}, e_{1}, \dots, e_{7}}$ is an orthonormal basis for the division ring of octonions with $e_{0} = 1$ . Then define matrices $A_{0}, \dots, A_{7} : V \to V$ by setting $A_{j} v = e_{j} * v$ for all $j$ . Our goal is to transform $(A_{0}, \dots, A_{7})$ into other tuples of matrices that satisfy similar properties.

If $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ are matrices, then define the $L_{2}$

-spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ as

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} =$

$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})^{1 / 2} \cdot ρ (B_{1} \otimes_{1} + \dots + B_{r} \otimes_{r})^{1 / 2}}$

where $ρ$ denotes the spectral radius, $\otimes$ is the tensor product, and $¯ ¯¯¯ ¯ X$ is the complex conjugate of $X$ applied elementwise.

Let $d \in {1, \dots, 8}$ , and let $F_{d}, G_{d}, H_{d}$ denote the maximum value of the fitness level $8 \cdot ∥ (A_{0}, \dots, A_{7}) ≃ (X_{0}, \dots, X_{7}) ∥^{2}$ such that each $X_{j}$ is a complex $d \times d$ anti-symmetric matrix ( $X = - X^{T}$ ), a complex $d \times d$ symmetric matrix ( $X = X^{T}$ ), and a complex $d \times d$ -Hermitian matrix ( $X = X^{*}$ ) respectively.

The following calculations were obtained through gradient ascent, so I have no mathematical proof that the values obtained are actually correct.

$G_{1} = 2$ , $H_{1} = 1$

$G_{2} = 3$ , $H_{2} = 3$

$F_{3} = 1 + \sqrt{3}$ , $G_{3} = 3.5$ , $H_{3} = 1 + 2 \sqrt{2}$

$F_{4} = 4$ , $G_{4} = 4$ , $H_{4} = 1 + 3 \sqrt{2}$

$F_{5} = (5 + \sqrt{13}) / 2$ , $G_{5} = 4.5$ , $H_{5} \approx 5.27155841$

$F_{6} = 5$ , $G_{6} = 5$ , $H_{6} = 3 + 2 \sqrt{2}$

$F_{7} = 6$ , $G_{7} = 2 + 2 \sqrt{3} \approx 5.4641$ , $H_{7} = 1 + 2 \sqrt{7}$

$F_{8} = 7$ , $G_{8} = 6$ , $H_{8} = 7$

Observe that with at most one exception, all of these values $F_{d}, G_{d}, H_{d}$ are algebraic half integers. This indicates that the fitness function that we maximize to produce $F_{d}, G_{d}, H_{d}$ behaves mathematically and can be used to produce new tuples $(X_{1}, \dots, X_{r})$ from old ones $(A_{1}, \dots, A_{r})$ . Furthermore, an AI can determine whether something notable is going on with the new tuple $(X_{1}, \dots, X_{r})$ in several ways. For example, if $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥^{2}$ has low algebraic degree at the local maximum, then $(X_{1}, \dots, X_{r})$ is likely notable and likely behaves mathematically (and is probably quite interpretable too).

The good behavior of $F_{d}, G_{d}, H_{d}$ demonstrates that the octonions are compatible with the $L_{2}$ -spectral radius similarity. The operators $(A_{0}, \dots, A_{7})$ are all orthogonal, and one can take the tuple $(A_{0}, \dots, A_{7})$ as a mixed unitary quantum channel that is very similar to the completely depolarizing channel. The completely depolarizing channel completely mixes every quantum state while the mixture of orthogonal mappings $(A_{0}, \dots, A_{7})$ completely mixes every real state. The $L_{2}$ -spectral radius similarity works very well with the completely depolarizing channel, so one should expect for the $L_{2}$ -spectral radius similarity to also behave well with the octonions.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments