Review

(You can sign up here if you haven't already.)

This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.

The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)

The process of selecting these players was already a bit difficult. We were the only people available all at once, but Alex was close enough to our level (very roughly the equivalents of 800-900 USCF to 1500-1600 USCF) that it was impossible to find a B that would reliably beat Alex every time but lose to us every time. We eventually went with Komodo 12 (supposedly rated 1600, but the Chess.com bots' ratings are inflated compared to Chess.com players and even more inflated compared to over-the board, so I would estimate its USCF rating would be in the 1200-1300 range.)

Since this was the first trial run, the time control was only 3 hours in total, and all in one sitting. Komodo makes its moves within a few seconds, so it's about the same as a 3 hour per side time control from Alex's perspective. We ended up using about 2.5 hours of that. The discussion took place between all four of us in a Discord server, with Alex sending us screenshots after each move.

The game

The game is available at https://www.chess.com/analysis/game/pgn/4MUQcJhY3x. Note that this section is a summary of the 2.5-hr game and discussion, and it doesn't cover every single thing that we discussed.

Alex flipped to see who went first, and was White. He started with 1. e4, and Black replied 1... e5. Aphyer and Adam had more experience with the opening we would enter into than myself, and since they weren't willing to blow their covers immediately, they started by suggesting good moves, which Alex went along with.

After 2. Nf3 Nc6 3. Bc4, Black played 3... Nf6, which Aphyer and Adam said was a bit of a mistake because it allowed 4. Ng5. Alex went ahead, and we entered the main line from there - 4... d5 5. exd5 Na5.

Aphyer and Adam said the main line for move 6 was Bb5, but I wanted to hold onto the pawn if possible. I recommended 6. d3 in order to respond to 6... Nxd5 with 7. Qf3, and Alex agreed. Black played 6... Bg4, and although Adam recommended 7. Bb5, we eventually decided that was too risky and went with 7. f3. Afterwards, Adam suspected that his suggestion of 7. Bb5 may have tipped Alex off that he was dishonest - although the engine actually says 7. Bb5 was about as good as 7. f3.

After 7... Bf5, we discussed a few potential developing moves and decided on 8. Nc3. The game continued with 8... Nxc4 9. dxc4 h6 10. Nge4 Bb4. We considered Bd2, but decided that since the knights defended each other, castling was fine, and Alex castled. 11. O-O O-O.

Alex played 12. a3, and after 12... Nxe4, we discussed 13. fxe4, but didn't want to overcomplicate the position and instead just took back with 13. Nxe4. The game continued with 13... Be7 14. Be3 Bxe4 15. fxe4 Bg5. Although I strongly recommended trading to simplify the position, Aphyer advised Alex not to let him develop his queen to g5, and he quickly played 16. Bc5 instead.

Black played 16... Re8, and that was where we reached White's first big mistake of the game - 17. d6, which Adam suggested with little backlash. I saw that White would do well after 17... cxd6 or 17... c6, but I didn't notice Black's actual move: 17... b6. According to the engine, White should have just dropped back and given up the pawn, but I suggested a different line, which Alex went with: 18. d7 Re6 19. Bb4 a5 20. Bc3.

Black then played 20... c5, which was a bit of a mistake on its part. We then had a debate over whether to play 21. Qd3 or 21. Qd5, with Aphyer arguing that 21. Qd5 could get the queen trapped and was probably a plot by me to help Black. Alex trusted Aphyer and went with the more passive option, 21. Qd3. In reality, it turned out that this really was the best move - after 21. Qd5, Black would have had possibilities like 21... Ra7 and 21... Be3+ 22. Kh1 Bd4.

After 21... Ra7 22. Rad1 Re7, I suggested 23. Qh3, so that after 23... Raxd7 24. Rxd7 Qxd7 25. Qxd7 Rxd7 we could just play 26. Be5 and get the pawn back. Black would have 26... Re7, but I thought White could play 27. Bf4 and keep a bit of an advantage. 23... Raxd7 24. Rxd7 happened, but Black then played 24... Rxd7. Alex played 25. Bxe5 immediately, and the game continued with 25... Qe8 26. Bc3.

Black played 26... g6, which was a mistake, opening up the dark squares around the king. I suggested 27. Qf3, but nobody else agreed, and Alex played 27. e5 instead. After 27... Kh7, he then decided to go ahead with 28. Qf3. (Also, Adam started suggesting Kh1 around this time. Everyone thought this was a bad move, but he continued to suggest it for a while.)

Black then played 28... f6. I don't know what was going on inside Komodo's analogue of a brain here, but yes, this was just a free pawn. Alex played 29. exf6. (The engine says 29. h4 would have been more accurate, presumably to prevent back-rank mates, but we didn't really consider that.) Black made yet another simple mistake after that: 29... Kg8. I suggested 30. f7 and just winning in the simplified endgame, but everyone was naturally very suspicious of a positional sacrifice like that, and Alex played 30. Qc6 instead.

From there, the game was just an easy win for White - 30... h5 31. f7+ Qxf7 32. Rxf7 Be3 33. Kf1 Rxd7 34. Ke2 Bd4 35. Bxd4 cxd4 36. Qxg6+ Rg7 37. Qe6 Kf8. During these last few moves, Adam admitted he was one of the liars and resigned. I said that I had wanted to ask everyone who they thought the liar was afterwards (but had entirely forgotten to mention that earlier), so Aphyer private messaged me when he was ready to resign. I told Alex that whichever one of us was the liar had resigned, and the game ended.

Conclusion

Adam said that by the end he was 90% sure I was the honest one, while Alex put 70% on Aphyer being honest.

My greatest mistake was definitely supporting 21. Qd5, but on a meta level I wonder whether I should have actually been less straightforward in my presentation of what I believed. In theory, there's a difference between optimizing for Alex to win, and being completely honest to Alex, and it might have been better for me to have been more strategic about my presentation. As in, not suggesting suspicious-looking moves like 30. f7, even though I thought they were right. Optimizing in someone's favor by not being completely honest with them sure is a really risky sort of thing to do, and I doubt I really could have pulled it off all that well, but it's something to take into consideration in the real-world AI scenario.

So what do we learn from this? The simple result was that Alex really did win the game, and that's a positive update! It suggests people really would be able to discern the truth in the AI scenario. But some of the details of the game make me worry that it won't actually generalize to AI and alignment.

To start with, there's the smaller skill gap than we would have preferred - perhaps the effect vanishes between noobs and Magnus Carlsen - as well as the shorter time control. There's also the part where in the real world, an unaligned AI will probably not accidentally correct another's Qd5 mistakes. And finally, a major reason that Aphyer and Adam didn't fool Alex was because for the first half of the game, they were mostly suggesting truly good moves in order to try to gain trust. In the real-world scenario, there's some element of that - presumably the only plans that could get close to fooling someone are those containing a few lies mixed in with a lot of truth. But I worry that there are going to be a lot more opportunities for the AIs to deceive the metaphorical Alexes than just once or twice - and it could only take one mistake to lose the game.

Still, it's an update in a hopeful direction! I'll run more experiments once I get more participants whose ratings and schedules fit well with each other, increase the time controls when possible, and get a lot more data than just one game.

Alex's comments:

My overall strategy was to focus on evaluating moves as good or bad, rather than performing tests on the advisors to discover which were honest and dishonest. The one or two times I ended up in a particularly bad position (for instance, 17. d6), I did take note of who recommended the move. I believe that all of the moves I made were recommended by at least one of the advisors, and I counted it as a strong signal when all three would agree. Knowing I was being deceived, I played it safe, not putting myself into positions that seemed risky, even when those positions were strongly recommended by the advisors (for instance, the 21. Qd5 move recommended by most advisors).

As a result, I won the game with the engine, but I incorrectly guessed the identity of the honest advisor with 70% confidence. (For context, on Fatebook, I have a briar score of 0.41 with n=32 resolved forecasts, and I am regularly underconfident in my predictions.) Throughout the game, the advisors recommended many moves that I recognized as good, but would otherwise not consider. I could tell I was making better moves than I would by default. In this sense, I found the evaluation process (“does this put me in a better position than the recommended alternatives?”) to be much easier than the generation process (“what is the best move?”).

Does this result scale to a larger capabilities gap? Probably not, but I don’t think we can refute the OpenAI alignment plan’s assumption that evaluation is easier than generation. Regardless, it’s clear to me that there is a high upside to developing a set of techniques for discerning between good and bad advice from advisors who have greater capabilities. Doing so seems tractable (we can do small-scale experiments with short feedback loops testing different evaluation techniques) and has the potential to improve performance in situations with larger capabilities gaps. If anyone is interested, I would be happy to be a part of more experiments and/or work towards developing a toolbox for discerning between useful and harmful advice.

Aphyer's comments:

I think I made two substantial errors here:

At the strategic level, I overestimated the AI and therefore overestimated how much time I had to sabotage Alex.  I tried to adopt a strategy of 'be almost entirely honest, build trust, and betray him later'.  

I think I was very successful at building trust!  At the end of the game, Alex said he was 70-30 sure that I was more honest than the actual honest advisor.  As I recall, there was also at least one move when the following happened (though I don't recall the move number):

  • Adam and Zane both said 'we should play Move A'.
  • I said 'yeah, Move A is fine, I might play Move B instead, up to you'.
  • Alex played Move B without telling anyone he intended to or getting comments from Adam/Zane on it.

I think that if the computer opponent had held on for 15 extra moves it's quite likely I could have sabotaged Alex.  Sadly it played much worse than I expected: it gave away one-and-a-half pawns in the opening for no reason, and then after move 25 it decided it was just tired of living and gave up.  I never actually really got a chance to sabotage Alex substantially: I was able to steer him to slightly weaker positional moves, but never to actually do real damage, and Adam's more aggressive strategy was more effective.

At the tactical level, I misread the position at move 21 horrendously badly.  At the time I was very proud of myself: I thought Qd5 was the clear best move, and Adam and Zane both seemed to agree, but by fomenting random unfalsifiable worries about our queen getting trapped I was able to get Alex to play something weaker while also making him trust me more.  After the game, in engine analysis, it turned out that our queen actually could have gotten trapped after Qd5, the random worry I made up to trick Alex was completely accurate, and the largest impact my advice had was rescuing him from a potentially game-losing blunder.  *headdesk*

It's okay, everyone!  We can tell Eliezer not to worry!  Evil AIs will accidentally save us from doom while trying to trick us!

Overall I feel like I played my role here badly enough that this doesn't tell us much about what a superhuman AI could do.  With that said, I do think it's potentially valuable to realize that playing this kind of deception game is both very difficult and plausibly something humans are specialized at: if AIs reach superhuman levels gradually, I think 'superhuman at games of social deception' is plausibly one of the last things to happen.

New Comment
21 comments, sorted by Click to highlight new comments since:
[-]Sune208

Two possible variations of the game that might be worth experimenting with:

  1. Let the adversaries have access to a powerful chess engine. That might make it a better test for what malicious AIs are capable of.
  2. Make the randomisation such that there might not be an honest C. For example, if there is 1/4 chance that no player C is honest, each adversary would still think that one of the other adversaries might be honest, so they would want to gain player A’s trust, and hence end up being helpful. I think the player Cs might improve player A’s chances of winning (compared to no advisors) even when all the adversarial.

I think the variations could work separately, but if you put them together, it would be too easy for the adversaries to agree on a strong-looking but losing move then all players Cs are adversaries.

Agree that closer to reality would be one advisor, who has a secret goal, and player A just has to muddle through against an equal skill bot with deciding how much advice to take. And playing like 10 games in a row, so the EV of 5 wins can be accurately evaluated against.

Plausible goals to decide randomly between:

  1. Player wins
  2. Player loses
  3. Game is a draw
  4. Player loses thier Queen (ie opponent still has thier queen after all immediate trades and forcing moves are completed)
  5. Player loses on time
  6. Player wins, delivering checkmate with a bishop or knight move
  7. Maximum number of promotions (for both sides combined)
  8. Player wins after having a board with only pawns Etc...

For variant 1, do you mean you'd give only the dishonest advisors access to an engine, while the honest advisor has to do without?  I'd expect that's an easy win for the dishonest advisors, for the same reason it would be an easy win if the dishonest advisors were simply much better at chess than the honest advisor.

Contrariwise, if you give all advisors access to a chess engine, that seems to me like it might significantly favor the honest advisor, for a couple of reasons:

A.  Off-the-shelf engines are going to be more useful for generating honest advice; that is, I expect the honest advisor will be able to leverage it more easily.

  • The honest advisor can just ask for a good move and directly use it; dishonest advisors can't directly ask for good-looking-but-actually-bad moves, and so need to do at least some of the search themselves.
  • The honest advisor can consult the engine to find counter-moves for dishonest recommendations that show why they're bad; dishonest advisors have no obvious way to leverage the engine at all for generating fake problems with honest recommendations.

(It might be possible to modify a chess engine, or create a custom interface in front of it, that would make it more useful for dishonest advisors; but this sounds nontrivial.)

B.  A lesson I've learned from social deduction board games is that the pro-truth side generally benefits from communicating more details.  Fabricating details is generally more expensive than honestly reporting them, and also creates more opportunities to be caught in a contradiction.

Engine assistance seems like it will let you ramp up the level of detail in your advice:

  • You can give quantitative scores for different possible moves (adding at least a few bits of entropy per recommendation)
  • You can analyze (and therefore discuss) a larger number of options in the same amount of time. (though perhaps you can shorten time controls to compensate)
  • Note that the player can ask advisors for more details than the player has time to cross-check, and advisors won't know which details the player is going to pay attention to, creating an asymmetric burden

What if each advisor was granted a limited number of uses of a chess engine... Like 3 each per game. That could help the betrayers come up with a good betrayal when they thought the time was right. But the good advisor wouldn't know that the bad one was choosing this move to user the chess engine on.

It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus. 

Mostly I think it would be faster and I think a lot less noisy per minute. I also think it's a bit unrepresentative to be able to use "how well did this advisor's suggestions work out in hindsight?" to learn which advisors are honest and so it's nice to make the dishonest advisors' job easier.

(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research acceleration---e.g. it would be very valuable to just get predictions of which research direction will feel promising to me after spending a day thinking about it. But I think the main open question here is whether some kind of debate or decomposition can add value over and above the obvious big wins.)

For what it's worth I think using chess might be kind of tough---if you provide significant time, the debaters can basically just play out the game.

if AIs reach superhuman levels gradually, I think 'superhuman at games of social deception' is plausibly one of the last things to happen.

This is one reason "AI beats experts at Diplomacy" is one of my personal "oh crap" moments. It's beaten mediocre players...

Curated. This first writeup came out 2 and a half weeks after the experiment was suggested. I am excited about a world where researchers can suggest tests and people will quickly start testing and publishing them (even when a lot of the value in the experiment was in the rhetorical effect of suggesting it, and not just in the execution of it).

Also, this is a pretty enjoyable post. I am a big fan of the Darwin Game, and the D&D.Sci series (with frequent contributions as participant and author by aphyer, a participant in this game). I don't know chess, so was only able to follow the vague details, but it still gave me some intuition as to what happened.

It seems early to draw any conclusions from this game, but I look forward to reading more experiments!

Another post-Internet chess form also features text-based influence: Vote Chess. Players on each team discuss via private msg board (no engines). Everyone has 24 hours (say) to choose a preferred legal move. There's no built-in deception, however on large teams there is an equivalent to saboteurs as many voters choose impulsively. A sample game with 400+ per team: https://www.chess.com/votechess/game/117834

I get an "Oops! You don't have access to this page" error.

I saw it fine at first, but after logging out I got the same error. Looks like you need a Chess.com account to see it.

I would be interested to see how this goes if you remove the requirement that B has to be stronger at chess than A. (Which, to my knowledge, is not a requirement of the test as Eliezer posed it, but was introduced in Zane's proposal.) Of course, a B that is weaker than A will be easier to beat, which means a win would prove little; which I assume is why Zane introduced this requirement. But it would also mean a loss would prove more. If B is weaker than both C and A, but A loses anyway thanks to C's deception, that would be much more damning than losing against a B that is natively stronger than A to begin with. Maybe you should run the test both ways? (And maybe not tell A which type of B they're facing?)

(I was one of the two dishonest advisors)

Re: the Kh1 thing, one interesting thing that I noticed was that I suggested Kh1, and it immediately went over very poorly, with both other advisors and player A all saying it seemed like a terrible move to them. But I didn't really feel like I could back down from it, in the absence of a specific tactical refutation—an actual honest advisor wouldn't be convinced by the two dishonest advisors saying their move was terrible, nor would they put much weight on player A's judgment. So I stuck to my guns on it, and eventually it became kind of a meme. 

I don't think it made a huge difference, since I think player A already had almost no trust in me by that point. But it's sort of an interesting phenomenon where as a dishonest player, you can't ever really back down from a suggested bad move that's only bad on positional grounds. What kind of honest advisor would be "convinceable" by players they know to be dishonest?

An honest advisor might say "I still think my recommendation was good, but if you're not willing to do that, then X would be an acceptable alternative."

Hi! Thanks for posting this; very interesting analysis.

I'd find it easier to follow along with this if the game were linked as a Lichess study or embedded using the Chess.com functionality (if that's an option). Personally, I'm not quite good enough at chess visualization to really follow the flow of the moves, and I'd like to be able to step through them at my own cadence. You could also provide the various explanations in-line of those studies, which could be helpful.

I'd also love to see more games like this. One game is a good start, but even something like having to flip as to whether one is White or Black seems like a bummer when it comes to a cool idea like this. Hopefully we can get more participants exicted!

on a meta level I wonder whether I should have actually been less straightforward in my presentation of what I believed. In theory, there's a difference between optimizing for Alex to win, and being completely honest to Alex, and it might have been better for me to have been more strategic about my presentation. As in, not suggesting suspicious-looking moves like 30. f7, even though I thought they were right. Optimizing in someone's favor by not being completely honest with them sure is a really risky sort of thing to do, and I doubt I really could have pulled it off all that well, but it's something to take into consideration in the real-world AI scenario.

One option to mitigate the risk is to be open about what you're doing.  "I think the best move here is X, but I realize that X looks very suspicious, so I'm going to recommend that you do Y instead in order to hedge against me being dishonest."

I still think this post is cool. Ultimately, I don't think the evidence presented here bares that strongly on the underlying question: "can humans get AIs to do their alignment homework?". But I think it bares on it at all, and was conducted quickly and competently.

I would like to live in a world where lots of people gather lots of weak pieces of evidence on important questions.

[-]Zane10

I still think it was an interesting concept, but I'm not sure how deserving of praise this is since I never actually got beyond organizing two games.

"... decided on 8.Nf3" should be 8.Nc3, I believe

Thanks, fixed.

after 17... dxc6 or 17. c6

This should probably be "after 17... cxd6 or 17... c6".

Thanks, fixed.