I might be misunderstanding, but it looks to me like your proposed extension is essentially just the Elo model with some degrees of freedom that don't yet appear to matter?
The dot product has the property that <theta_A-theta_B,w> = <theta_A,w> - <theta_B,w>, so the only thing that matters is the <theta_P,w> for each player P, which is just a single scalar. So we are on a one-dimensional scale again where predictions are based on taking a sigmoid of the difference between a single scalar associated with each player.
As far as I can tell, the way that such a model could still be a nontrivial extension of Elo would be if you posited w could vary between games, whether randomly from some distribution or whether via additional parameters associated to players that influence what w is in the games they are involved in, or other things like that. But it seems you would need something like that, or else some source of nonlinearity, because if w is constant then every dimension orthogonal to that fixed w can never have any effect on predictions by the model.
I assume you're familiar with the case of the parallel postulate in classical geometry as being independent of other axioms? Where that independence corresponds with the existence of spherical/hyperbolic geometries (i.e. actual models in which the axiom is false) versus normal flat Euclidean geometry (i.e. actual models in which it is true).
To me, this is a clear example of there being no such thing as an "objective" truth about the the validity of the parallel postulate - you are entirely free to assume either it or incompatible alternatives. You end up with equally valid theories, it's just those theories are applicable to different models, and those models are each useful in different situations, so the only thing it comes down to is which models you happen to be wanting to use or explore or prove things about on a given day.
Similarly for the huge variety of different algebraic or topological structures (groups, ordered fields, manifolds, etc) - it is extremely common to have statements that are independent of the axioms, e.g. in a ring it is independent of the axioms whether multiplication is commutative or not. And both choices are valid. We have commutative rings, and we have noncommutative rings, and both are self-consistent mathematical structures that one might wish to study.
Loosely analogous to how one can write a compiler/interpreter for a programming language within other programming languages, some theories can easily simulate other theories. Set theories are particularly good and convenient for simulating other theories, but one can also simulate set theories within other seemingly more "primitive" theories (e.g. simulating it in theories of basic arithmetic via Godel numbering). This might be analogous to e.g. someone writing a C compiler in Brainfuck. Just like how it's meaningless to talk about whether a programming language or a given sub-version or feature extension of a programming language is more "objectively true" than another, there are many who take the position that the same holds for different set theories.
When you say you're "leaning towards a view that maintains objective mathematical truth" with respect to certain axioms, is there some fundamental principle by which you're discriminating the axioms that you want to assign objective truth from axioms like the parallel postulate or the commutativity of rings, which obviously have no objective truth? Or do you think that even in these latter cases there is still an objective truth?
This thread analyzes what is going on under the hood with the chess transformer. It is a stronger player than the Stockfish version it was distilling, at the cost of more compute but only by a fixed multiplier, it remains O(1).
I found this claim suspect because this basically is not a thing that happens in board games. In complex strategy board games like Chess, practical amounts of search on top of a good prior policy and/or eval function (which Stockfish has), almost always outperforms any pure forward pass policy model that doesn't do explicit search, even when that pure policy model is quite large and extensively trained. With any reasonable settings, it's very unlikely that the distillation of Stockfish into a pure policy model produces a better player than Stockfish.
I skimmed the paper (https://arxiv.org/pdf/2402.04494), and had trouble finding such a claim, and indeed it seems the original poster of that thread later retracted that claim as due to their own mistake in interpreting the data table of the paper. The post where they acknowledge the mistake is much less prominent than the original post, link here: https://x.com/sytelus/status/1848239379753717874 . The chess transformer remains quite a bit weaker than the Stockfish it tries to predict/imitate.
Do you think a vision transformer trained on 2-dimensional images of the board state would also come up with a bag of heuristics or would it naturally learn a translation invariant algorithm taking advantage of the uniform way the architecture could process the board? (Let's say that there are 64 1 pixel by 1 pixel patches, perfectly aligned with the 64 board locations of an 8x8 pixel image, to make it maximally "easy" for both the model and for interpretability work.)
And would it differ based on whether one used an explicit 2D positional embedding, or a learned embedding, or a 1D positional embedding that ordered the patches from top to bottom, right to left?
I know that of course giving a vision transformer the actual board state like this shortcircuits the cool part where OthelloGPT tries to learn its own representation of the board. But I'm wondering if even in this supposedly easy setting it still would end up imperfect with a tiny error rate and a bag-of-heuristics-like way of computing legal moves.
And brainstorming a bit here: a slightly more interesting setting that might not shortcircuit the cool part would be if the input to the vision transformer was a 3D "video" of the moves on the board. E.g. the input[t][x][y] is 1 if on turn t, a move was made at (x,y), and 0 otherwise. Self-attention would presumably be causally-masked on the t dimension but not on x and y. Would we get a bag of heuristics here in the computation of the board state and the legal moves from that state?
(KataGo dev here, I also provided a bit of feedback with the authors on an earlier draft.)
@gwern - The "atari" attack is still a cyclic group attack, and the ViT attack is also still a cyclic group attack. I suspect it's not so meaningful to put much weight on the particular variant that one specific adversary happens to converge to.
This is because the space of "kinds" of different cyclic group fighting situations is combinatorically large and it's sort of arbitrary what local minimum the adversary ends it because it doesn't have much pressure to find more once it finds one that works. Even among just the things that are easily put into words without needing a diagram - how big is the cycle? Does the cyclic group have a big eye (>= 4 points behaves tactically distinctly) or a small eye (<=3 points), or no eye? Is the eye a two-headed-dragon-style eye, or not? Does it have more loose connections or is it solid? Is the group inside locally dead/unsettled/alive? Is the cycle group racing against an outside group for liberties or only making eyes of its own, or both? How many liberties do all the various groups each have? Are there ko-liberties? Are there approach-liberties? Is there a cycle inside the cycle? etc.
This is the same as how in Go the space of different capturing race situations in general is combinatorically large, with enough complexity that many situations are difficult even for pro players who have studied them for a lifetime.
The tricky bit here is that there seems to not be (enough) generalization between the exponentially large space of large group race situations in Go more broadly and the space of situations with cyclic groups. So whereas the situations in "normal Go" get decently explored by self-play, cyclic groups are rare in self-play so there isn't enough data to learn them well, leaving tons of flaws, even for some cases humans consider "simple". A reasonable mental model is that any particular adversary will probably find one or two of them somewhat arbitrarily, and then rapidly converge to exploit that, without discovering the numerous others.
The "gift" attack is distinct and very interesting. There isn't a horizon effect involved, it's just a straightforward 2-3 move blind spot of both the policy and value heads. Being only 2-3 moves this is why it also gets fixed more easily by search than cyclic groups. As for why it happens, as a bit of context I think these are true:
I've taken a look at the raw neural net outputs and it's also clear that the neural net has no idea that the superko rule matters - predictions don't vary as you change the superko rule in these positions. So my best guess is that that the neural net perhaps "overgeneralizes" and doesn't easily learn that in this one specific shape with this specific order of moves, the superko rule, which almost never matters, suddenly does matter and flips the result.
Apparently not a writeup (yet?), but there appears to be a twitter post here from LC0 with an comparison plot of accuracy on tactics puzzles: https://x.com/LeelaChessZero/status/1757502430495859103?s=20
Yes, rather than resolving the surprise of "the exact sequence HHTHTTHTTH" by declaring that it shouldn't be part of the set of events, I would prefer to resolve it via something like:
Here's my intuition-driving example/derivation.
Fix a reference frame and suppose you are on a frictionless surface standing next to a heavy box equal to your own mass, and you and the box always start at rest relative to one another. In every example, you will push the box leftward, adding 1 m/s leftward velocity to the box, and adding 1 m/s rightward velocity to yourself.
Let's suppose we didn't know what "kinetic energy" is, but let's suppose such a concept exists, and that whatever it is, an object of your mass has 0 units of it when at rest, and it is a continuous monotonic function of the absolute value of that object's velocity. Let's also take as an assumption that when you perform such a push like the above, you are always adding precisely 1 unit of this thing called "kinetic energy" to you and the box combined.
Okay so suppose the box and you are at rest and you perform this push, and start moving at 1m/s left and right, respectively. You and the box started with 0 units of kinetic energy, and you added 1 unit total. Since you and the box have the same absolute value of velocity, your energies are equal, so you each must have gotten 1/2 of a unit. Great, therefore we derive 1 m/s is 1/2 unit of kinetic energy.
Now suppose you and the box start out at a velocity of 1m/s rightward, so you have 1/2 unit of energy each, for a total of 1 unit. You perform the same push, bringing the total kinetic energy to 2 units. The box ends up at 0 m/s, so it has 0 units of energy now. You end up going 2m/s rightward, with all the energy. Great, therefore we derive 2 m/s is 2 units of kinetic energy.
Now suppose you and the box start out at a velocity of 2m/s rightward, so you have 2 units of energy each, for a total of 4 units. You perform the same push, bringing the total kinetic energy to 5 units. The box ends up at 1 m/s, so it has 1/2 unit of energy now, since we derived earlier that 1m/s is 1/2 unit of energy. You end up going 3m/s rightward. So you must have the other 4.5 units of energy. Therefore we derive 3 m/s is 4.5 units of kinetic energy.
We can continue this indefinitely, without running into any inconsistencies or contradictions. This "kinetic energy" thing so far seems to be a self-consistent concept given these assumptions! In general, we derive that an object of our mass moving at velocity v is has a kinetic energy of 1/2 v^2 units.
And I hope this makes it clearer why kinetic energy has to behave quadratically. A quadratic function f is precisely the kind of function such the quantity f(x+c) + f(x-c) - 2f(x) is constant with respect to x. It's the only function that satisfies the property that a fixed "amount of push" of the propellant you are carrying away from you always adds the same total energy into the combined system of you + propellant.
And it also gives some intuition for why you end up with more energy when you fire the propellant while moving faster. When you pushed the box while initially at 0 m/s, your kinetic energy went from 0 units to 0.5 units (+0.5), but when you pushed the box while initially at 1 m/s, your kinetic energy went from 0.5 units to 2 units (+1.5), and when you pushed the box while initially at 2 m/s, your kinetic energy went from 2 units to 4.5 units (+2.5) and in all cases you only added 1 unit of energy yourself. Where does the extra energy come from? From slowing down the box's rightward motion, and/or from not speeding up the box to go leftward from rest.
> lack of sufficient evidence.
Perhaps more specifically, evidence that is independent from the person that is to be trusted or not. Presumably when trusting someone else that something is true, often one does so due to believing that the other person is being honest and reliable enough such that that their word is sufficient evidence to then take some action. It's just that there isn't sufficient evidence without that person's word.
Circling back to this with a thing I was thinking about - suppose one wanted to figure out just one additional degree of freedom to the Elo rating a player had (at a given point in time, if you also allow evolution over time) that would add as much improvement as possible. Almost certainly you need more dimensions than that to properly fit real idiosyncratic nonlinearities/nontransitivities (i.e. if you had a playing population with specific pairs of players that were especially strong/weak only against specific other players, or cycles of players where A beats B beats C beats A, etc), but if you just wanted to work out what the "second principal component" might be, what's a plausible guess?
First, you can essentially reproduce the Elo model by rather than each player having a rating and the winning chance being a function of the difference between their ratings, instead you posit that each player has a rating and when they play a game, they each indepedently sample a random value from a fixed probability distribution centered around their own rating, and the player with the larger sample wins.
I think that you exactly reproduce the Elo model up to scaling if this distribution is a Gumbel distribution, because the difference of two Gumbels is apparently equivalent to a draw from a logistic distribution, and the CDF of the logistic distribution is precisely the sigmoid that the Elo model posits. But in practice, you should end up with almost the same thing if you choose any other reasonable distribution so long as it has the right heaviness of tail.
In particular, I'd expect having linearly-exponential tails is good rather than quadratically-exponential tails like the normal distribution has, because linearly-exponential tails tend to be desirable for real-world ratings models due to being much more outlier-resistant and in the real world you have issues like forfeits, sandbaggers, internet disconnection/timeouts, etc. (If you have a quadratically exponential tail, then a ratings model can put so low probability on an outlier that subject to seeing the outlier, the ratings model is forced to make a too-large update to accommodate it, this should be intuitive from a Bayesian perspective). Outliers and noise and the realities of real world ratings data I'd expect introduces far bigger variation in ratings quality anyways than any minor distribution-shape differences would.
So for example, you could also say each player draws from a logistic distribution, rather than only a Gumbel. The difference of two logistics is not quite a logistic distribution but up to rescaling it should be pretty close so this is nearly the Elo model again.
Anyways, with any reformulation like this, there is a very natural candidate now for a second dimension - that of the variance of the distribution that a player draws their sample from. Rather than each player drawing from a fixed distribution centered around their rating before seeing who has the higher value and wins, we now add a second parameter that allows the variance of that distribution to vary by player. So the ratings model now becomes able to express things like "this player is more variable in performance between games, or prone to blunders uncharacteristic of their skill level than this other player". This parameter might also improve the rating system's ability to "explain away" things like sandbagger players by assigning them a high variance, thereby reducing their distortionary impact on other players' ratings even before manual intervention.