It seems like you're telling us to do the thing that's reliably not helped your career. There's got to be some happy medium between getting things done, and letting people know what you're getting done.
This reads like a bit of a rant against bad management. But that's also universal in your stories, so you have to wonder if there's a principle in place that makes managers misunderstand how productive people really are. I think there is. It's really hard as a manager to tell who's reliably solving hard problems, because you often don't have time to understand all of the problems and see which are surprisingly hard and which are surprisingly easy.
I don't actually hear an argument here against strategizing. A good strategy includes when to stop strategizing and start going brr. Distance running is a remarkably brr-oriented activity. Service industries are in between, but clearly don't benefit as much from strategy as novel and complex projects.
"Just take a love-minus-hate activation and add that to the prompt activation" sounds like an absolute newb idea. I like that idea but if I were trying to find the expert in a room then that statement would've disqualified them.
Very true.
transformer_lens
doesn't work on convolutional networks, and so I had to come up with my own ideas.I think that many (not all) of your above examples boil down to optimizing for legibility rather than optimizing for goodness. People who hobnob instead of working quietly will get along with their bosses better than their quieter counterparts, yes. But a company of brown nosers will be less productive than a competitor company of quiet hardworking employees! So there's a cooperate/defect-dilemma here.
What that suggests, I think, is that you generally shouldn't immediately defect as hard as possible, with regard to optimizing for appearances. Play the prevailing local balance between optimizing-for-appearances and optimizing-for-outcomes that everyone around does, and try to not incrementally lower the level of org-wide cooperation. Try to eke that level of cooperation up, and set up incentives accordingly.
Good point. I am concerned that adding even a dash of legibility screws the work over completely and immediately and invisibly rather than incrementally. I may have over-analyzed my data so I should probably return to the field to collect more samples.
It doesn't seem correct to me that adding even a dash of legibility "screws the work over" in the general case. I do agree there are certainly situations where the right solution is illegible to all (except the person implementing it). But both in that case and in general, talking to and getting along with the boss both makes things more legible, and will tend to increase quality. I expect that in the cases of you working well and not getting rewarded much, spending a little time interacting with your boss would both improve your outcomes, and importantly, also make your output even better than it already was.
ReLU activation actually comes from real mathematics. People simply don't recognize neural networks with ReLU activation as mathematical since it is a different and research level area of mathematics that is usually not taught to people. Another reason that ReLU seems non-mathematical is that people seem to have a bias towards mathematical structures that resemble the field of real numbers, integers, or at least rings of matrices. Mathematicians have much less of such a bias, but they still have a bias; I am a mathematician who has studied Laver tables, and nearly nobody has studied Laver tables recently in part because they are non-commutative, non-associative and arise from very large cardinals yet are studied using computer calculations.
In mathematics, we want the algebraic structures not to be too simple or too complex, but we want them to work in the right way so that we can prove theorems about them. This means that the mathematical structures often have a moderate amount of complexity, but they are often quite simple.
If we use the logistic or tanh activation function do not seem to give very much interesting mathematics when we combine these specific functions with linearity and when we compose these functions with themselves (by the universal approximation theorem, neural networks with tanh activation can approximate interesting functions but the neural networks themselves are just approximations for what people truly find interesting). On the other hand, we can compose ReLU with linear functions and more copies of ReLU and we will always have piecewise linear functions, and mathematicians can prove theorems about piecewise linear functions. On the other hand, if you use tanh activation, you get functions such as which are quite unappealing to mathematicians since you cannot prove theorems about them that well.
Give (or if you want you can make the domain ) two binary operations where . Then the operations form a semiring. This means that , (associativity), , (commutativity), (distributivity), etc. The (max) tropical semiring is the semiring .
We can define polynomials over tropical semirings.
A tropical polynomial function is a function of the form
.
Define the tropical division operation by .
A tropical rational function is a function of the form. Every tropical polynomial is a tropical rational function. The collection of all tropical rational functions forms a semi-field. One should think of tropical geometry as a kind of algebraic geometry where we replace the field of complex numbers with the tropical semiring.
Theorem: Let be a function. Then the following are equivalent:
I personally do not see the fact that the coefficients have to be integers as a significant restriction for neural networks. After all, we can just multiply using traditional multiplication any rational-valued ReLU neural network by an integer and make it integer valued.
Corollary: Let be a function. Then the following are equivalent:
One can learn more about how tropical geometry relates to neural networks in the paper Tropical Geometry of Deep Neural Networks by Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. The paper is not too hard.
Why don't we see deep neural networks that are built using tropical matrix multiplication?
Why don't we see deep neural networks that are built using tropical matrix multiplication?
There's no shortage of people wanting to apply ideas from their own discipline to deep learning. Physicists want to use field theory, logicians want to use formal logic... The fact that certain architectures can be described in the language of tropical math does not in itself guarantee that this leads to real insight. The paper you mention says that it does, and it has over 100 citations, so, maybe something will turn up.
By the way, a friend and I ran across your paper on generalized Laver tables a few years ago (while we were brainstorming a number theory problem). Could that be applied to deep learning? I don't see it for ordinary deep learning, but who knows, maybe consciousness is based on "nonabelions" and a Laver-like algebra describes their entanglement structure. :-)
The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.
Machine learning models from generalized Laver tables?
I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.
But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.
Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.
Let be a set, and let be a binary operation on that satisfies the self-distributivity identity . Define the right powers for inductively by setting and . We say that is a reduced nilpotent self-distributive algebra if satisfies the identities for and if for all there is an with . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra where if for each , then for some . Here we make the convention to put the implied parentheses on the left so that .
Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If is a self-distributive algebra, then we can define and in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define precisely when and precisely when . We can define to be the equivalence class of with respect to and . The operations on are and whenever we have our composition operation . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra where are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but but there is a surjective homomorphism . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.
Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.
From Laver-like algebras, we can also produce non-commutative polynomials. If is a Laver-like algebra, and is a generating set for , then for each non-maximal critical point , we can define a non-commutative polynomial to be the sum of all non-commutative monomials of the form where and but where for . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.
I have used these non-commutative polynomials to construct the infinite product formula
where the polynomials are obtained from different non-trivial rank-into-rank embeddings . This means that the non-commutative ring operations in the rings of non-commutative polynomials are meaningful for Laver-like algebras.
If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.
Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.
Using deep learning to investigate generalized Laver tables
We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.
The n-th classical Laver table is the unique self-distributive algebraic structure where whenever . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.
Problem: Compute the -th classical Laver table for as large as we can achieve.
Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table and with machine learning, I have personally gotten up to (I think I have a couple of errors, but those are not too significant), and I have some of the computation of .
Suppose now that . Then Dougherty has shown that one can easily recover the entire function from the restriction where for all . Suppose now that . Then let denote the least natural number such that . Then there is a number called the threshold of at such that and and such that for every with , we have whenever and for . The classical Laver table can be easily recovered from the table and the threshold function . To go beyond Dougherty's calculation of to and beyond, we exploit a couple of observations about the threshold function :
i: in most cases, , and
ii: in most cases, for some .
Let , and let where is the least natural number where and . Then one can easily compute from and the data by using Dougherty's algorithm. Now the only thing that we need to do is compute in the first place. It turns out that computing is quite easy except when is of the form or for some . In the case, when or , we initially set We then repeatedly modify until we can no longer find any contradiction in the statement . Computing essentially amounts to finding new elements in the set and we find new elements in using some form of machine learning.
In my calculation of , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in to find new elements in , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.
My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in belong to and has a natural tensor product structure. One can therefore consider to be a subset of . Furthermore, a tuple of elements in can be considered as a subset of . This means that the linear layers from to should be Kronecker products or Kronecker sums (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices are defined by setting and respectively. Perhaps neural networks can be used to generate new elements in . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.
Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.
Solution: The non-commutative polynomials are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials into unique sequences of real matrices (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.
Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.
Solution: The set of all multigenic Laver tables over an alphabet forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.
I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.
Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.
I was thinking of constructing a deep neural network by interlacing ordinary linear layers with tropical layers where the refer to tropical matrix multiplication and addition. One can think of this as replacing the ReLU activation with the more complicated expression . Someone probably thought about this already and they may have already run experiments using this approach.
One problem with this approach is that for an input , there will typically be a unique where or where . This means that only one of the biases will contribute to the output . Once trained, this is not a bad thing because it simply means that the tropical operation is sparse and sparse matrices are good for saving space, but during training we may want to pump up the values of so that and at the end of training, we can allow for the probability to go down so that we can reduce the number of weights in the tropical portion of the matrix.
Another possible issue with tropical matrices is how tropical matrices may reduce the dimension of the data. Consider the tropical layer where is an -matrix. Then for almost all , there will be a function where ( stands for the Jacobian). If image is too small, then the tropical layer may throw away too much information from the vector .
One may also need to apply a couple of other tricks to make sure that the tropical layers work right. For example, one can start the training just by using ReLU activation and then after everything is going right, we can replace ReLU activation with the more general tropical layers. By starting off with ReLU, we can ensure that is large enough so that the tropical layers do not destroy too much information. Further training with tropical layers can only improve the network.
Any ReLU MLP can be turned into a composition of ordinary linear layers with tropical linear layers. Is there any reason why this would not work? I think I should do a couple of experiments with tropical matrix operations in the place of ReLU to see if it works and what should be done to optimize neural networks formed by interlacing tropical matrix operations with ordinary matrix operations. In an MLP, most of the parameters are the entries for a matrix in order to linearly transform a vector to another vector, but for some reason, we do not load the the activation layer with parameters.
I do not think that I have much more knowledge of tropical geometry than the average mathematician, so I do not think I am too biased in favor of tropical geometry.
In my corporate jobs, simplicity was prized due to the high informational load. In fact, many colleagues appeared to absorb only sharp soundbites.
On the other hand, even if you had the answer, implementing it without 'aligning stakeholders' first would be deadly.
Haha, funny post because its so relatable.
I think of operating productively without drama as playing the long game. Its a lot easier to control critical factors without other people stepping in because they think something's wrong. Eventually when people see the results, the attention you missed out on during the process is paid off.
I think strategizing over how to present the end result so that it's valued is an important key to reaping the benefits of productivity.
Like it. Seems like another way of saying that sometimes what you really need is more dakka. Tagged the post as such to reflect that.
The grumpy vs talkative example and the Alice vs Bob example remind me of the knowledgeability vs competence debates that I used to have before and after conducting interviews. When I first entered the corporate world I thought knowledge was paramount, I learned very quickly that likeability and team synergy are valued much more. The best line workers who couldn't break into the management team even after years of hard work often wondered what they were doing wrong, their metrics and productivity were consistently the highest and they got along with everyone on their teams. Why weren't they advancing? My advice to them was focus less on productivity and more on impressing the boss.
Corporate real estate is what I call it when I want to sound fancy. Really, it was a call center for a relocation company which was a subsidiary of a large real estate company.
Our department was like a dispatch service, we took calls from customers (of companies we had contracts with) and after a short exposition-heavy conversation we’d refer them to the real estate firms that were under the parent company’s umbrella.
A real estate agent would be automatically assigned and receive our referral. It was free and if they closed with our agent they’d get a kickback (from the referral fee that we took out of the agent’s commission).
I was a supervisor and quit 2 years ago but recently learned that the department was downsized and merged with another because I think they realized 10 people could do the work they had ~60 people doing: 1 director, 3 managers, 7 supervisors, ~40 line workers and ~5 administrative support workers (these last 2 numbers would often fluctuate).
ReLU activation is the stupidest ML idea I've ever heard; everyone knows sigmoid um somehow feels optimal you know it is a real function from like real math. (ReLU only survived because it got a ridiculous acronym word thing and sounds complicated so you feel smart.)
No, ReLU is great, because it induces semantically meaningful sparseness (for the same geometric reason which causes L1-regularization to induce sparseness)!
It's a nice compromise between the original perceptron stepfunction (which is incompatible with gradient methods) and the sigmoids which have tons of problems (saturate unpleasantly on the ends and don't want to move from there).
What's dumb is that instead of discovering the goodness of ReLU in the early 1970-s (natural timeline, given that ReLU has been introduced in the late 1960-s and, in any case, is very natural, being the integral of the step function), people had only discovered the sparseness-inducing properties of ReLU in 2000, published that in Nature of all places, and it was still ignored completely for another decade, and only after people published 3 papers of more applied flavor in 2009-2011, it was adopted, and by 2015 it overcame sigmoids as the most popular activation function in use, because it worked so much better. (See https://en.wikipedia.org/wiki/Rectifier_(neural_networks) for references.)
It's quite likely that without ReLU AlexNet would not be able to improve the SOTA as spectacularly as it did, triggering the "first deep learning revolution".
That being said, it is better to use them in pairs (relu(x), relu(-x))
; this way you always get signal (e.g. TensorFlow has crelu function which is exactly this pair of relu
's).
Of course ReLU is great!! I was trying to say that if I were a 2009 ANN researcher (unaware of prior ReLU uses like most people probably were at the time) and someone (who had not otherwise demonstrated expertise) came in and asked why we use this particular woosh instead of a bent line or something, then I would've thoroughly explained the thought out of them. It's possible that I would've realized how it works but very unlikely IMO. But a dumbworker more likely to say "Go do it. Now. Go. Do it now. Leave. Do it." as I see it.
Is the 12th virtue incompatible with everything else?
Working smarter is about prioritizing, planning, allocating time & energy, strategizing, and results. See eg Humans are not automatically strategic
Working dumber is going brrr
Costs
All my good work was bad for my career and all my bad work was good for my career. Five different jobs I had with five different bosses:
Yes the most important thing is often obvious and doing it makes you look simple unless you smartify that simplicity somehow. This ties into a broader challenge:
Extraneous complexity attractors
A simple problem-framing with a simple solution has some emotional and social disadvantages against a complex problem with a detailed, piecemeal solution (familiar things):
This all makes extraneous complexity a strong attractor — not only in discussions and papers and posts but also one's private attention and ruminations. This can be overcome by going brr. Strategizing makes it worse.
Conclusion
Top performers dull as hell? I love listening to interviews but damn I cannot stand athlete interviews they sound practically empty to me. Most CEO interviews too. Made it through an Adam Mosseri (Instagram CEO) interview and it was very boring. Honestly he sounded stupid. Musk was an exception but in exchange now he's fallen a bit in the twitter hole. (More boring CEOs were immune to this.)
Most of my life when I'm performing better I'm ignored/boring and when I'm attended-to it's because I'm shooting the shit or whatever. I mean you gotta shoot the shit sometimes. Seems that strategizing is a good strategy sometimes but rarely.
I make no claim to originality. This might even be obvious to everyone else but me. Couldn't find an old quote about doing vs understanding but would like to find one.
I could maybe make this argument stronger by calling out particularly important and bad cases of smart-not-dumb research. Let me know in a comment if this would be helpful.
Anyway go forth and work dumbly go brrr