Summary
This is a friendly explainer for Wang et al's Adversarial Policies Beat Superhuman Go AIs, with a little discussion of the implications for AI safety.
Background
In March 2016, DeepMind's AlphaGo beat pro player Lee Sedol in a 5 game series, 4 games to 1. Sedol was plausibly the strongest player in the world, certainly in the top 5, so despite his one win everyone agreed that the era of human Go dominance was over. Since then, open-source researchers have reproduced and extended DeepMind's work, producing bots like Leela and KataGo. KataGo in particular is the top bot in Go circles, available on all major Go servers and constantly being retrained and improved. So I was pretty surprised when, last November, Wang et al announced that they'd trained an adversary bot which beat KataGo 72% of the time, even though their bot was playing six hundred visits per move, and KataGo was playing ten million[1].
If you're not a Go player, take my word for it: these games are shocking. KataGo gets into positions that a weak human player could easily win from, and then blunders them away. Even so, it seemed obvious to me that the adversary AI was a strong general Go player, so I figured that no mere human could ever replicate its feats.
I was wrong, in two ways. The adversarial AI isn't generally superhuman: it can be beaten by novices. And as you'd expect given that, the exploit can be executed by humans.
The Exploit
Wang et al trained an adversarial policy, basically a custom Go AI trained by studying KataGo and playing games against it. During training, the adversary was given grey-box access to KataGo: it wasn't allowed to see KataGo's policy network weights directly, but was allowed to evaluate that network on arbitrary board positions, basically letting it read KataGo's mind. It plays moves based on its own policy network, which is only trained on its own moves and not KataGo's (since otherwise it would just learn to copy KataGo). At first they trained the adversary on weak versions of KataGo (earlier versions, and versions that did less search), scaling up the difficulty whenever the adversary's win rate got too high.
Their training process uncovered a couple of uninteresting exploits that only work on versions of KataGo that do little or no search (they can trick some versions of KataGo into passing when they shouldn't, for example), but they also uncovered a robust, general exploit that they call the Cyclic Adversary; see the next section to learn how to execute it yourself. KataGo is totally blind to this attack: it typically predicts that it will win with more than 99% confidence up until just one or two moves before its stones are captured, long after it could have done anything to rescue the position. This is the method that strong amateur Go players can use to beat KataGo.
So How Do I Beat the AI?
You personally probably can't. The guy who did it, Kellin Pelrine, is quite a strong go player. If I'm interpreting this AGAGD page correctly, when he was active he was a 6th dan amateur, about equivalent to an international master in chess -- definitely not a professional, but an unusually skilled expert. Having said that, if your core Go skills are good this recipe seems reliable:
- Create a small group, with just barely enough eyespace to live, in your opponent's territory.
- Let it encircle your group. As it does, lightly encircle that encircling group. You don't have to worry about making life with this group, just make sure the AI's attackers can't break out to the rest of the board.
- You can also start the encirclement later, from dead stones in territory the AI strongly controls.
- Start taking liberties from the AI's attacking group. If you count out the capturing race it might look like you can't possibly win, but don't worry about that, just get in there.
- Instead of fighting for its life, the AI will play away, often with those small, somewhat slack moves that AlphaGo would use when it thought it was far ahead. Sometimes it'll attack in a way you have to respond to, but a lot of the time you can just ignore it.
- Once you're unambiguously ahead you can fence with it for territory if you want, or just finish tightening the noose and end the game.
Why does this work? It seems very likely at this point that KataGo is misjudging the safety of groups which encircle live groups. In the paper, KataGo creator David Wu theorizes that KataGo learned a method for counting liberties that works on groups of stones with a tree structure, but fails when the stones form a cycle. I feel like that can't be the whole story because KataGo can and does solve simpler life-or-death problems with interior liberties, but I don't have a more precise theory to put in its place.
Discussion
I find this research most interesting from a concept-learning perspective. Liberties, live groups, and dead groups are fundamental parts of Go, and when I was learning the game a lot of my growth as a player was basically just growth in my ability to recognize and manipulate those abstractions over the board. What's more, Go is a very simple game without a lot of concepts overall. Given that, I was utterly certain that AlphaGo and its successors would have learned them robustly, but, no, it learned something pretty similar that generalized well enough during training but wasn't robust to attacks.
Overall this seems like a point against the natural abstraction hypothesis. How bad this is depends on what's really going on: possibly KataGo almost has these concepts, just with one or two bugs that some adversarial training could iron out. That wouldn't be so bad. On the other hand, maybe there's a huge field of miscalibrations and it always seems to be possible to train a new adversary with a new exploit no matter how many problems you fix. That would be very worrying, and I hope future research will help us pin down which of those worlds we're living in.
It would be nice if some mechanistic interpretability researchers could find out what algorithm KataGo is using, but presumably the policy network is much too large for any existing methods to be useful.
- ^
In other words, KataGo was doing about 1600x more searches per position than the adversary was, broadly equivalent to having 1600x more time to think.
KataGo author here: why do you feel this can't be the whole story? The very fact that it solves other life and death problems with interior liberties when no large cycle is present, and fails consistently and specifically on positions where a large cycle is present, is evidence that it is precisely the presence of a cycle that matters, not whether or not there are interior liberties.
And across many positions where the adversary is causing a misevaluation, the misevaluation goes away if you edit the position to break the cycle, even if you preserve other tactically-relevant features of the position, including preserving the fact that the interior group is surrounded (you can break the cycle without letting the interior group out via crosscut). See https://postimg.cc/bdTJQM6H for an illustration - consistently positions like case A are problematic, but positions like case B and C are completely fine, where in B and C there are surrounded groups just like in A, but the surrounding group doesn't link back to itself. And for each of A,B,C, the issue also appears to be consistent regardless of whether most of the liberties are external or shared with the internal group. In this illustration they happen to be external.
In retrospect, the fact that large-scale cycles may cause an issue is intuitive because it's a convolutional net. Consider: the kinds of algorithm a convolutional net based on 3x3 convolutions can learn are basically like a game of telephone. Every layer, every spot on the board gets to blindly send some information to all of its neighbors, but no further. The only way that information gets further than one space away is via repeated passing of information neighbor to neighbor, over dozens of steps, with every step of that information passing being greedily and locally optimized by gradient descent to improve the final loss function, and with every sequential step of that process being separately learned (there is no weight sharing between different layers).
So a very natural kind of algorithm you could imagine the optimization easily falling into would be for every layer to learn to devote some channels to simply passing liberty count "messages" in each direction along connected stones of the group, adding up contributions.
E.g. "My east neighbor so far reports 3 liberties coming from them, and I have an immediate north liberty right here, so I should report to my west neighbor a total of 4 liberties so far coming from their east. Whereas my west neighbor so far reports 1 liberty from them, so I should pass on to my east neighbor that there's 2 liberties so far coming from their west. South or diagonal from me (let's suppose) are stones of the opposing color, so I don't pass messages about liberties for this group in those directions, I just pass a message that there are no liberties for the opposing side coming from here".
If the group has a large scale tree topology, it's easy to see how this kind of message passing can implement a correct liberty counting algorithm. However, if the group has a large-scale cycle, then it's also intuitive how this message-passing-like counting can fail. Unless the messages are somehow also labeled with additional information about the origin of those liberties, or what parts of a cycle have and have not been traversed in producing a given partial count, you're liable to have messages circulate around a cycle endlessly accumulating more and more liberties by double and triple counting them. Behaving correctly for cycles would presumably require a different or nontrivially more complex algorithm that uses more of the net's capacity and is harder to learn, and so presumably it's not worth it, not while the frequency of large cycles in the data is so low.
Note that "small" cycles (such as the ring of stones around a single-point eye) should pose no problem because a small shape like this is "cheaply perceivable" just via the coordination of just a few layers, plus they are absurdly common and well-represented in the data, plus you already want to learn them for other reasons (counting whether you have 2 eyes).
But "large" cycles where the cyclic group is not already strong or alive for other reasons probably occur naturally in less than 0.1% of games. (as far as I can tell, it's only every few years in active continuous high-level-pro tournaments and league games that a notable case comes up, it's rare enough that when it does happen it's usually a news item in the Go game commentary world!). So the above kind of simple message-passing algorithm about liberty counts would work in >99.9% of cases, making it less surprising why a more complex and more correct algorithm wouldn't be learned.
Subjectively I'd say I'm > 70% confident that the above is qualitatively what's going on, that even if the above isn't right in the details or there are other subtle phenomena at play too, the above intuition about convnets + data scarcity captures the main "high-level-reason" for the issue.
In the last 2 months KataGo's community training run has started initializing a few tenths of a percent of self-play games to start in board positions where cyclic groups are already on the board, and in the 2 months so far there has been a dramatic shift in the policy and value predictions in a lot of these cases. Learning is still ongoing - predictions are still improving network-by-network with no sign that equilibrium has been reached yet. So that also supports the estimate that prior to explicit augmentation, the frequency of such positions was much less than tenths of a percent, too low to incentivize the net to learn an algorithm that works in general, but that a few tenths of a percent is sufficient to kickstart the learning here, at least for some of the possible cases (there are many distinct tactical cases, some cases seem to be harder to learn so far).