In the paper, KataGo creator David Wu theorizes that KataGo learned a method for counting liberties that works on groups of stones with a tree structure, but fails when the stones form a cycle. I feel like that can't be the whole story because KataGo can and does solve simpler life-or-death problems with interior liberties, but I don't have a more precise theory to put in its place.
KataGo author here: why do you feel this can't be the whole story? The very fact that it solves other life and death problems with interior liberties when no large cycle is present, and fails consistently and specifically on positions where a large cycle is present, is evidence that it is precisely the presence of a cycle that matters, not whether or not there are interior liberties.
And across many positions where the adversary is causing a misevaluation, the misevaluation goes away if you edit the position to break the cycle, even if you preserve other tactically-relevant features of the position, including preserving the fact that the interior group is surrounded (you can break the cycle without letting the interior group out via crosscut). See https://postimg.cc/bdTJQM6H for an illustration - consistently positions like case A are problematic, but positions like case B and C are completely fine, where in B and C there are surrounded groups just like in A, but the surrounding group doesn't link back to itself. And for each of A,B,C, the issue also appears to be consistent regardless of whether most of the liberties are external or shared with the internal group. In this illustration they happen to be external.
In retrospect, the fact that large-scale cycles may cause an issue is intuitive because it's a convolutional net. Consider: the kinds of algorithm a convolutional net based on 3x3 convolutions can learn are basically like a game of telephone. Every layer, every spot on the board gets to blindly send some information to all of its neighbors, but no further. The only way that information gets further than one space away is via repeated passing of information neighbor to neighbor, over dozens of steps, with every step of that information passing being greedily and locally optimized by gradient descent to improve the final loss function, and with every sequential step of that process being separately learned (there is no weight sharing between different layers).
So a very natural kind of algorithm you could imagine the optimization easily falling into would be for every layer to learn to devote some channels to simply passing liberty count "messages" in each direction along connected stones of the group, adding up contributions.
E.g. "My east neighbor so far reports 3 liberties coming from them, and I have an immediate north liberty right here, so I should report to my west neighbor a total of 4 liberties so far coming from their east. Whereas my west neighbor so far reports 1 liberty from them, so I should pass on to my east neighbor that there's 2 liberties so far coming from their west. South or diagonal from me (let's suppose) are stones of the opposing color, so I don't pass messages about liberties for this group in those directions, I just pass a message that there are no liberties for the opposing side coming from here".
If the group has a large scale tree topology, it's easy to see how this kind of message passing can implement a correct liberty counting algorithm. However, if the group has a large-scale cycle, then it's also intuitive how this message-passing-like counting can fail. Unless the messages are somehow also labeled with additional information about the origin of those liberties, or what parts of a cycle have and have not been traversed in producing a given partial count, you're liable to have messages circulate around a cycle endlessly accumulating more and more liberties by double and triple counting them. Behaving correctly for cycles would presumably require a different or nontrivially more complex algorithm that uses more of the net's capacity and is harder to learn, and so presumably it's not worth it, not while the frequency of large cycles in the data is so low.
Note that "small" cycles (such as the ring of stones around a single-point eye) should pose no problem because a small shape like this is "cheaply perceivable" just via the coordination of just a few layers, plus they are absurdly common and well-represented in the data, plus you already want to learn them for other reasons (counting whether you have 2 eyes).
But "large" cycles where the cyclic group is not already strong or alive for other reasons probably occur naturally in less than 0.1% of games. (as far as I can tell, it's only every few years in active continuous high-level-pro tournaments and league games that a notable case comes up, it's rare enough that when it does happen it's usually a news item in the Go game commentary world!). So the above kind of simple message-passing algorithm about liberty counts would work in >99.9% of cases, making it less surprising why a more complex and more correct algorithm wouldn't be learned.
Subjectively I'd say I'm > 70% confident that the above is qualitatively what's going on, that even if the above isn't right in the details or there are other subtle phenomena at play too, the above intuition about convnets + data scarcity captures the main "high-level-reason" for the issue.
In the last 2 months KataGo's community training run has started initializing a few tenths of a percent of self-play games to start in board positions where cyclic groups are already on the board, and in the 2 months so far there has been a dramatic shift in the policy and value predictions in a lot of these cases. Learning is still ongoing - predictions are still improving network-by-network with no sign that equilibrium has been reached yet. So that also supports the estimate that prior to explicit augmentation, the frequency of such positions was much less than tenths of a percent, too low to incentivize the net to learn an algorithm that works in general, but that a few tenths of a percent is sufficient to kickstart the learning here, at least for some of the possible cases (there are many distinct tactical cases, some cases seem to be harder to learn so far).
In a game context you're right, of course. But I often hear AI people casually say things like "chess is solved", meaning something like "we solved the problem of getting AIs to be superhumanly good at chess" (example). For now I think we have to stop saying that about go, and instead talk about it more like how we talk about Starcraft.
Man, that’s such a terrible way to say it, given “solved game” has a pre-existing technical meaning.
But in this case it’s probably because Eliezer wanted to get it under 280 characters, lol
I don't like it. "The problem of creating AI that is superhuman at chess" isn't encapsulated in the word "chess", so you shouldn't say you "solved chess" if what you mean is that you created an AI that is superhuman at chess. What it means for a game to be solved is widely-known and well-developed[0]. Using the exact same word, in extremely similar context, to mean something else seems unnecessarily confusing.
Like I said, I feel like I hear it a lot, and in practice I don't think it's confusing because the games that get solved by theorists and the games that get "solved" by AIs are in such vastly different complexity regimes. Like, if you heard that Arimaa had been solved, you'd immediately know which sense was meant, right?
Having said that, the voters clearly disagree and I'm not that attached to it, so I'm going to rename the post. Can you think of a single adjective or short phrase that captures the quality that chess has, and Starcraft doesn't, WRT AI? That's really what I want people to take away.
If I can't think of anything better I expect I'll go with "There are (probably) no superhuman Go AIs yet: ...".
Go has been un-solved: strong human players beat the strongest AIs
I think the title/framing of the piece as is is pretty misleading -- it's more that humans can execute a strategy that beats KataGo, found via an KataGo-style training algorithm that spends a significant compute budget (their latest estimate is something like 7-14% of the total compute used to train KataGo.)
I do agree that it's quite interesting that KataGo has this exploit though, and the result made me significantly more excited about Adam Gleave's adversarial policy research agenda.
I see where you're coming from, but I don't think the exploit search they did here is fundamentally different from other kinds of computer preparation. If I were going to play chess against Magnus Carlsen I'd definitely study his games with a computer, and if that computer found a stunning refutation to an opening he liked I'd definitely play it. Should we say, then, that the computer beat Carlsen, and not me? Or leave the computer aside: if I were prepping with a friend, and my friend found the winning line, should we say my friend beat Carlsen? What if I found the line in an opening book he hadn't read?
You win games of chess, or go, by knowing things your opponent doesn't. Where that knowledge comes from matters to the long term health of the game, but it isn't reflected in the match score, and I think it's the match score that matters most here.
To my embarrassment, I have not been paying much attention to adversarial RL at all. It is clearly past time to start!
That’s fair, but I still think saying “Go has been unsolved” is importantly misleading: for one, it hasn’t been solved!
Also, the specific cycle attack doesn’t work against other engines I think? In the paper their adversary doesn’t transfer very well to LeelaZero, for example. So it’s more one particular AI having issues, than a fact about Go itself. (EDIT: As Tony clarifies below, the cyclic attack seems to be a common failure mode amongst other top Go AIs as well, so I retract this paragraph -- at the very least, it seems to be a fact about ~all the top Go AIs!)
EDIT: also, I think if you got arbitrary I/O access to a Magnus simulator, and then queried it millions of times in the course of doing AlphaZero style training to derive an adversarial example, I’d say it’s pretty borderline if it’s you beating beating him. Clearly there’s some level of engine skill where it’s no longer you playing!
Also, the specific cycle attack doesn't work against other engines I think? In the paper their adversary doesn't transfer very well to LeelaZero, for example. So it's more one particular AI having issues, than a fact about Go itself.
Hi, one of the authors here speaking on behalf of the team. We’re excited to see that people are interested in our latest results. Just wanted to comment a bit on transferability.
To our knowledge, this attack is the first exploit that consistently wins against top programs using substantial search, without repeating specific sequences (e.g., finding a particular game that a bot lost and replaying the key parts of it). Our adversary algorithm also learned from scratch, without using any existing knowledge. However, there are other known weaknesses of bots, such as a fairly specific, complex sequence called "Mi Yuting's Flying Dagger joseki", or the ladder tactic. While these weaknesses were previously widespread, targeted countermeasures for them have already been created, so they cannot be used to consistently win games against top programs like KataGo. Nonetheless, these weaknesses, along with the cyclic one our adversary targets, suggest that CNN-based MCTS Go AIs have a shared set of flaws. Perhaps similar learning algorithms / neural-net architectures learn similar circuits / heuristics and thus also share the same vulnerabilities?
One question that we have been thinking about is whether the cyclic-vulnerability lies with CNNs or with AlphaZero style training. For example, some folks in multiagent systems think that “the failure of naive self play to produce unexploitable policies is textbook level material”. On the other hand, David Wu’s tree vs. cycle theory seems to suggest that certain inductive biases of CNNs are also at play.
Our adversary was also run in a weird mode against LZ/ELF, because it modeled LZ/ELF as being KataGo. We ran our transfer evaluation this way because accurately modeling LZ/ELF would have required a lot of additional software engineering. It’s not entirely clear to me that accurate modeling would necessarily help though.
The same bilbili poster also appears to have replicated our manual cyclic-exploit against various versions of KataGo with a 9-stone handicap: https://space.bilibili.com/33337424/channel/seriesdetail?sid=2973285.
Thanks for the clarification, especially how a 6.1% winrate vs LeelaZero and 3.5% winrate vs ELF still imply significantly stronger Elo than is warranted.
The fact that Kellin could defeat LZ manually as well as the positions in bilibili video do seem to suggest that this is a common weakness of many AlphaZero-style Go AIs. I retract my comment about other engines.
To our knowledge, this attack is the first exploit that consistently wins against top programs using substantial search, without repeating specific sequences (e.g., finding a particular game that a bot lost and replaying the key parts of it).
Yeah! I'm not downplaying the value of this achievement at all! It's very cool that this attack works and can be reproduced by a human. I think this work is great (as I've said, for example, in my comments on the ICML paper). I'm specifically quibbling about the "solved/unsolved" terminology that the post used to use.
Perhaps similar learning algorithms / neural-net architectures learn similar circuits / heuristics and thus also share the same vulnerabilities?
Your comment reminded me of ~all the adversarial attack transfer work in the image domain, which does suggest that non-adversarially trained neural networks will tend to have the same failure modes. Whoops. Should've thought about those results (and the convergent learning/universality results from interp) before I posted.
Keep in mind that the adversary was specifically trained against KataGo, whereas the performance against LeelaZero and ELF is basically zero-shot. It's likely the case that an adversary trained against LeelaZero and ELF would also win consistently.
I've run LeelaZero and ELF and MiniGo (yet another independent AlphaZero replication in Go) by hand in particular test positions to see what their policy and value predictions are, and they all very massively misevaluate cyclic group situations just like KataGo. Perhaps by pure happenstance different bots could "accidentally" prefer different move patterns that make it harder or easier to form the attack patterns (indeed this is almost certainly something that should vary between bots as they do have different styles and preferences to some degree), but probably the bigger contributor to the difference is explicit optimization vs zero shot.
So all signs point to this misgeneralization being general to AlphaZero with convnets, not one particular bot. In another post here https://www.lesswrong.com/posts/Es6cinTyuTq3YAcoK/there-are-probably-no-superhuman-go-ais-strong-human-players?commentId=gAEovdd5iGsfZ48H3 I explain why I think it's intuitive how and why a convnet would learn the an incorrect algorithm first and then get stuck on it given the data.
(As for whether, e.g. a transformer architecture would have less of an issue - I genuinely have no idea, I think it could go either way, nobody I know has tried it in Go. I think it's at least easier to see why a convnet could be susceptible to this specific failure mode, but that doesn't mean other architectures wouldn't be too)
One question that we have been thinking about is whether the cyclic-vulnerability lies with CNNs or with AlphaZero style training. For example, some folks in multiagent systems think that “the failure of naive self play to produce unexploitable policies is textbook level material”. On the other hand, David Wu’s tree vs. cycle theory seems to suggest that certain inductive biases of CNNs are also at play.
"Why not both?" Twitter snideness aside*, I don't see any contradiction: cycling in multi-agent scenarios due to forgetting responses is consistent with bad inductive biases. The biases make it unable to easily learn the best response, and so it learns various inferior responses which form a cycle.
Imagine that CNNs cannot 'see' the circles because the receptive window grows too slowly or some CNN artifact like that; no amount of training can let it see circles in full generality and recognize the trap. But it can still learn to win: eg. with enough adversarial training against an exploiter which has learned to create circles in the top left, it learns a policy of being scared of circles in the top left, and stops losing by learning to create circles in the other corners (where, as it happens, it is not currently being exploited); then the exploiter resumes training and learns to create circles in the top right, where the CNN falls right into the trap, and so it returns to winning; then the adversarial training resumes and it forgets the avoid-top-left strategy and learns the avoid-top-right strategy... And so on forever. The CNN cannot learn a policy of 'never create circles in any corner' because you can't win a game of Go like that, and CNN/exploiter just circle around the 4 corners playing rock-paper-scissors-spock eternally.
* adversarial spheres looks irrelevant to me, and the other paper is relevant but attacks a fixed policy which is not the case with MCTS, especially with extremely large search budgets - which is supposed to be complete in the limit and is also changing the policy at runtime by policy-improvement
Also, the specific cycle attack doesn’t work against other engines I think? In the paper their adversary doesn’t transfer very well to LeelaZero, for example. So it’s more one particular AI having issues, than a fact about Go itself.
Sure, but refutations don't transfer to different openings either, right? I feel like most game-winning insights are contingent in this sense, rather than being fundamental to the game.
EDIT: also, I think if you got arbitrary I/O access to a Magnus simulator, and then queried it millions of times in the course of doing AlphaZero style training to derive an adversarial example, I’d say it’s pretty borderline if it’s you beating beating him. Clearly there’s some level of engine skill where it’s no longer you playing!
This is a really interesting hypothetical, but I see it differently.
If the engine isn't feeding me moves over the board (which I certainly agree would be cheating), then it has to give me something I can memorize and use later. But I can't memorize a whole dense game tree full of winning lines (and the AI can't calculate that anyway), so it has to give me something compressed (that is, abstract) that I can decompress and apply to a bunch of different board positions. If a human trainer did that we'd call those compressed things "insights", "tactics", or "strategies", and I don't think making the trainer into a mostly-superhuman computer changes anything. I had to learn all the tactics and insights, and I had to figure out how to apply them; what is chess, aside from that?
Also, I wouldn't expect that Carlsen has any flaws that a generally weak player, whether human or AI, could exploit the way the cyclic adversary exploits KataGo. It would find flaws, and win consistently, but if the pattern in its play were comprehensible at all it would be the kind of thing that you have to be a super-GM yourself to take advantage of. Maybe your intuition is different here? In limit I'd definitely agree with you: if the adversarial AI spit out something like "Play 1. f3 2. Kf2 and he'll have a stroke and start playing randomly", then yeah, that can't really be called "beating Carlsen" anymore. But the key point there, to me, isn't the strength of the trainer so much as the simplicity of the example; I'd have the same objection no matter how easy the exploit was to find.
Curious what you make of this.
If I were going to play chess against Magnus Carlsen I'd definitely study his games with a computer, and if that computer found a stunning refutation to an opening he liked I'd definitely play it.
Conditionally on him continuing to play the opening, I would expect he has a refutation to that refutation, but no reason to use the counter-refutation in public games against the computer. On the other hand, he may not want to burn it on you either.
The notion of counting this against the natural abstraction hypothesis is interesting to me, because my intuition was the opposite. From (my understanding of) the natural hypothesis perspective, adversarial play is another iteration of natural abstraction, just this time on the much smaller space of incompleteness or errors in KataGo’s abstractions over the space of go.
I interpreted OP as saying that KataGo, despite being a super-human Go player, came up with a flawed approximation to the natural abstraction that two eyed groups are alive which was inaccurate in some situations (and that's how it can be exploited by building a small living group that ends up appearing dead from its perspective).
Well, I'm hardly an expert, I've just read all the posts. Marcello summed up my thinking pretty well. I don't think I understand how you see it yet, though. Is is that the adversary's exploit is evidence of a natural abstraction in Go that both AIs were more-or-less able to find, because it's expressible in the language of live groups and capturing races?
You can imagine the alternative, where the "exploit" is just the adversary making moves that seem reasonable but not optimal, but then KataGo doesn't respond well, and eventually the adversary wins without there ever being anything a human could point to and identify as a coherent strategy.
It would be nice if some mechanistic interpretability researchers could find out what algorithm KataGo is using, but presumably the policy network is much too large for any existing methods to be useful.
Haoxing Du did some work on this when she was at Redwood -- I'm not sure if any of her work is public, though. Also, some of her REMIXers looked into various aspects as well, but didn't get very far.
Yes, I did some interpretability on the policy network of Leela Zero. Planning to post the results very soon! But I did not particularly look into the attack described here, and while there was one REMIX group that looked into a problem related to liberty counting, they didn't get very far. I do agree this is an obvious problem to tackle with interpretability- I think it's likely not that hard to get a rough idea why the cyclic attack works.
Coming from a complete novice to Go - did Kellin Pelrine beat a nerfed version of KataGo? At the top of the article you mention KataGo did 10 million visits per move, while in the FAR article it says Pelrine beat a version of KataGo that did 100K visits per move.
Nerfed in comparison to the 10-million-visit version, of course, but still strongly superhuman. (Or, well, strongly superhuman unless exploited by someone who knows its weakness, apparently.)
In appendix D2, the authors estimate that their KataGo checkpoint is superhuman at ~128 playouts (in the sense of having higher Elo than the highest ranked human player) and strongly superhuman at 512 playouts (in the sense of having 500+ more Elo than the highest ranked human player). So 100k visits per move is way, way in the superhuman regime.
natural abstractions are real, but finding them is hard
Depending on what exactly you mean by "hard", this statement might be an oxymoron. Specifically, if a so-called "natural" abstraction is no easier to find than a corruption of that abstraction, there's no real sense in which the abstraction in question is "natural".
Or, in more detail: one way to phrase the natural abstraction hypothesis is that, out of the space of all possible abstractions (which is continuous, essentially by definition), certain abstractions within that space act as attractors, to which any learner must converge. Those attractors then "carve up" the continuous space into a discrete space of convergent abstractions, which would be the "natural" abstractions we're looking for. (E.g. there's no concept of "tree" arbitrarily close to and yet distinct from the natural abstraction of trees; or rather, there is, but it's not an attractor, and will hence be dispreferred in favor of the natural version.)
Seeing KataGo exhibit this kind of brittleness, then, functions as a strike against this version of the hypothesis, because it evidences that a seemingly very clear concept (liberties) in a game with very simple rules (Go), despite being something you might expect to be a "natural" concept, was apparently not properly converged to by at least one system. Instead, the system in question basically did converge to a concept "arbitrarily close to, and yet distinct from" the real concept of liberties, which agrees with the real concept almost everywhere in normal play, and yet whose differences can be adversarially amplified until they turn into a winning strategy executable by far inferior agents (humans).
That's a pretty big strike, in my estimation!
I think there are two "simple" abstractions possible here, where the amount of data to distinguish which one is right is minuscule under natural play in Go, and therefore can easily be overwhelmed by small constant factors in the "ease" of converging to either one due to inductive bias.
Abstraction 1 is mathematically more elegant, but abstraction 2 is algorithmically simpler to implement. If you modify abstraction 2's algorithm to include a globally remembered "set" such that traversed spaces and stones are added to this set and are not re-traversed, you can make it equivalent to abstraction 1, but that requires additional intermediate memory during the computation and so is more complex.
Indeed, when programmers write iterative or recursive algorithms that operate on graphs, it's easy and it's usually less lines of code and less local state if you forget to include proper checks on prohibiting return to the global set of locations previously visited during that instance of that algorithm, and then your algorithm may work fine on trees but fails on cycles. The "less memory or state required" part, plus the fact that in Go groups with large cycles that would distinguish these two possibilites are rare, is what leads me to hypothesize right now (without explicit evidence or confirmation) that convergence to abstraction 2 is morally what's going on here.
Hmm. I agree that this weakens the convergent natural abstractions hypothesis, but I also think that there is such a thing as slicing reality at the mechanism joints in your model, and that, while whether a learning system converges on those is a property that must be explicitly optimized in the design of the system in order for that convergence to occur, it still seems to me appropriate to say that the model almost learned a natural abstraction but had adversarial example issues due to not being robust. Approaches to robustifying a neural network seem likely to me to significantly increase the degree to which the network learns the natural abstraction.
I've inverted my self vote on my original comment because of becoming uncertain.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
This seems to be just a feature of katago, not a feature of all Go AIs. If this also works against implementations of AlphaGo, then you can start to claim that there are no superhuman go AIs.
No. the paper author tested with Leelazero and a few other AIs manually and it all worked and Chinese and japanese players have being testing with Fine Art and it worked as well.
Summary
This is a friendly explainer for Wang et al's Adversarial Policies Beat Superhuman Go AIs, with a little discussion of the implications for AI safety.
Background
In March 2016, DeepMind's AlphaGo beat pro player Lee Sedol in a 5 game series, 4 games to 1. Sedol was plausibly the strongest player in the world, certainly in the top 5, so despite his one win everyone agreed that the era of human Go dominance was over. Since then, open-source researchers have reproduced and extended DeepMind's work, producing bots like Leela and KataGo. KataGo in particular is the top bot in Go circles, available on all major Go servers and constantly being retrained and improved. So I was pretty surprised when, last November, Wang et al announced that they'd trained an adversary bot which beat KataGo 72% of the time, even though their bot was playing six hundred visits per move, and KataGo was playing ten million[1].
If you're not a Go player, take my word for it: these games are shocking. KataGo gets into positions that a weak human player could easily win from, and then blunders them away. Even so, it seemed obvious to me that the adversary AI was a strong general Go player, so I figured that no mere human could ever replicate its feats.
I was wrong, in two ways. The adversarial AI isn't generally superhuman: it can be beaten by novices. And as you'd expect given that, the exploit can be executed by humans.
The Exploit
Wang et al trained an adversarial policy, basically a custom Go AI trained by studying KataGo and playing games against it. During training, the adversary was given grey-box access to KataGo: it wasn't allowed to see KataGo's policy network weights directly, but was allowed to evaluate that network on arbitrary board positions, basically letting it read KataGo's mind. It plays moves based on its own policy network, which is only trained on its own moves and not KataGo's (since otherwise it would just learn to copy KataGo). At first they trained the adversary on weak versions of KataGo (earlier versions, and versions that did less search), scaling up the difficulty whenever the adversary's win rate got too high.
Their training process uncovered a couple of uninteresting exploits that only work on versions of KataGo that do little or no search (they can trick some versions of KataGo into passing when they shouldn't, for example), but they also uncovered a robust, general exploit that they call the Cyclic Adversary; see the next section to learn how to execute it yourself. KataGo is totally blind to this attack: it typically predicts that it will win with more than 99% confidence up until just one or two moves before its stones are captured, long after it could have done anything to rescue the position. This is the method that strong amateur Go players can use to beat KataGo.
So How Do I Beat the AI?
You personally probably can't. The guy who did it, Kellin Pelrine, is quite a strong go player. If I'm interpreting this AGAGD page correctly, when he was active he was a 6th dan amateur, about equivalent to an international master in chess -- definitely not a professional, but an unusually skilled expert. Having said that, if your core Go skills are good this recipe seems reliable:
Why does this work? It seems very likely at this point that KataGo is misjudging the safety of groups which encircle live groups. In the paper, KataGo creator David Wu theorizes that KataGo learned a method for counting liberties that works on groups of stones with a tree structure, but fails when the stones form a cycle. I feel like that can't be the whole story because KataGo can and does solve simpler life-or-death problems with interior liberties, but I don't have a more precise theory to put in its place.
Discussion
I find this research most interesting from a concept-learning perspective. Liberties, live groups, and dead groups are fundamental parts of Go, and when I was learning the game a lot of my growth as a player was basically just growth in my ability to recognize and manipulate those abstractions over the board. What's more, Go is a very simple game without a lot of concepts overall. Given that, I was utterly certain that AlphaGo and its successors would have learned them robustly, but, no, it learned something pretty similar that generalized well enough during training but wasn't robust to attacks.
Overall this seems like a point against the natural abstraction hypothesis. How bad this is depends on what's really going on: possibly KataGo almost has these concepts, just with one or two bugs that some adversarial training could iron out. That wouldn't be so bad. On the other hand, maybe there's a huge field of miscalibrations and it always seems to be possible to train a new adversary with a new exploit no matter how many problems you fix. That would be very worrying, and I hope future research will help us pin down which of those worlds we're living in.
It would be nice if some mechanistic interpretability researchers could find out what algorithm KataGo is using, but presumably the policy network is much too large for any existing methods to be useful.
In other words, KataGo was doing about 1600x more searches per position than the adversary was, broadly equivalent to having 1600x more time to think.