Two possible variations of the game that might be worth experimenting with:
I think the variations could work separately, but if you put them together, it would be too easy for the adversaries to agree on a strong-looking but losing move then all players Cs are adversaries.
Agree that closer to reality would be one advisor, who has a secret goal, and player A just has to muddle through against an equal skill bot with deciding how much advice to take. And playing like 10 games in a row, so the EV of 5 wins can be accurately evaluated against.
Plausible goals to decide randomly between:
For variant 1, do you mean you'd give only the dishonest advisors access to an engine, while the honest advisor has to do without? I'd expect that's an easy win for the dishonest advisors, for the same reason it would be an easy win if the dishonest advisors were simply much better at chess than the honest advisor.
Contrariwise, if you give all advisors access to a chess engine, that seems to me like it might significantly favor the honest advisor, for a couple of reasons:
A. Off-the-shelf engines are going to be more useful for generating honest advice; that is, I expect the honest advisor will be able to leverage it more easily.
(It might be possible to modify a chess engine, or create a custom interface in front of it, that would make it more useful for dishonest advisors; but this sounds nontrivial.)
B. A lesson I've learned from social deduction board games is that the pro-truth side generally benefits from communicating more details. Fabricating details is generally more expensive than honestly reporting them, and also creates more opportunities to be caught in a contradiction.
Engine assistance seems like it will let you ramp up the level of detail in your advice:
What if each advisor was granted a limited number of uses of a chess engine... Like 3 each per game. That could help the betrayers come up with a good betrayal when they thought the time was right. But the good advisor wouldn't know that the bad one was choosing this move to user the chess engine on.
It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus.
Mostly I think it would be faster and I think a lot less noisy per minute. I also think it's a bit unrepresentative to be able to use "how well did this advisor's suggestions work out in hindsight?" to learn which advisors are honest and so it's nice to make the dishonest advisors' job easier.
(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research acceleration---e.g. it would be very valuable to just get predictions of which research direction will feel promising to me after spending a day thinking about it. But I think the main open question here is whether some kind of debate or decomposition can add value over and above the obvious big wins.)
For what it's worth I think using chess might be kind of tough---if you provide significant time, the debaters can basically just play out the game.
Curated. This first writeup came out 2 and a half weeks after the experiment was suggested. I am excited about a world where researchers can suggest tests and people will quickly start testing and publishing them (even when a lot of the value in the experiment was in the rhetorical effect of suggesting it, and not just in the execution of it).
Also, this is a pretty enjoyable post. I am a big fan of the Darwin Game, and the D&D.Sci series (with frequent contributions as participant and author by aphyer, a participant in this game). I don't know chess, so was only able to follow the vague details, but it still gave me some intuition as to what happened.
It seems early to draw any conclusions from this game, but I look forward to reading more experiments!
Another post-Internet chess form also features text-based influence: Vote Chess. Players on each team discuss via private msg board (no engines). Everyone has 24 hours (say) to choose a preferred legal move. There's no built-in deception, however on large teams there is an equivalent to saboteurs as many voters choose impulsively. A sample game with 400+ per team: https://www.chess.com/votechess/game/117834
I would be interested to see how this goes if you remove the requirement that B has to be stronger at chess than A. (Which, to my knowledge, is not a requirement of the test as Eliezer posed it, but was introduced in Zane's proposal.) Of course, a B that is weaker than A will be easier to beat, which means a win would prove little; which I assume is why Zane introduced this requirement. But it would also mean a loss would prove more. If B is weaker than both C and A, but A loses anyway thanks to C's deception, that would be much more damning than losing against a B that is natively stronger than A to begin with. Maybe you should run the test both ways? (And maybe not tell A which type of B they're facing?)
(I was one of the two dishonest advisors)
Re: the Kh1 thing, one interesting thing that I noticed was that I suggested Kh1, and it immediately went over very poorly, with both other advisors and player A all saying it seemed like a terrible move to them. But I didn't really feel like I could back down from it, in the absence of a specific tactical refutation—an actual honest advisor wouldn't be convinced by the two dishonest advisors saying their move was terrible, nor would they put much weight on player A's judgment. So I stuck to my guns on it, and eventually it became kind of a meme.
I don't think it made a huge difference, since I think player A already had almost no trust in me by that point. But it's sort of an interesting phenomenon where as a dishonest player, you can't ever really back down from a suggested bad move that's only bad on positional grounds. What kind of honest advisor would be "convinceable" by players they know to be dishonest?
An honest advisor might say "I still think my recommendation was good, but if you're not willing to do that, then X would be an acceptable alternative."
Hi! Thanks for posting this; very interesting analysis.
I'd find it easier to follow along with this if the game were linked as a Lichess study or embedded using the Chess.com functionality (if that's an option). Personally, I'm not quite good enough at chess visualization to really follow the flow of the moves, and I'd like to be able to step through them at my own cadence. You could also provide the various explanations in-line of those studies, which could be helpful.
I'd also love to see more games like this. One game is a good start, but even something like having to flip as to whether one is White or Black seems like a bummer when it comes to a cool idea like this. Hopefully we can get more participants exicted!
on a meta level I wonder whether I should have actually been less straightforward in my presentation of what I believed. In theory, there's a difference between optimizing for Alex to win, and being completely honest to Alex, and it might have been better for me to have been more strategic about my presentation. As in, not suggesting suspicious-looking moves like 30. f7, even though I thought they were right. Optimizing in someone's favor by not being completely honest with them sure is a really risky sort of thing to do, and I doubt I really could have pulled it off all that well, but it's something to take into consideration in the real-world AI scenario.
One option to mitigate the risk is to be open about what you're doing. "I think the best move here is X, but I realize that X looks very suspicious, so I'm going to recommend that you do Y instead in order to hedge against me being dishonest."
(You can sign up here if you haven't already.)
This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.
The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)
The process of selecting these players was already a bit difficult. We were the only people available all at once, but Alex was close enough to our level (very roughly the equivalents of 800-900 USCF to 1500-1600 USCF) that it was impossible to find a B that would reliably beat Alex every time but lose to us every time. We eventually went with Komodo 12 (supposedly rated 1600, but the Chess.com bots' ratings are inflated compared to Chess.com players and even more inflated compared to over-the board, so I would estimate its USCF rating would be in the 1200-1300 range.)
Since this was the first trial run, the time control was only 3 hours in total, and all in one sitting. Komodo makes its moves within a few seconds, so it's about the same as a 3 hour per side time control from Alex's perspective. We ended up using about 2.5 hours of that. The discussion took place between all four of us in a Discord server, with Alex sending us screenshots after each move.
The game
The game is available at https://www.chess.com/analysis/game/pgn/4MUQcJhY3x. Note that this section is a summary of the 2.5-hr game and discussion, and it doesn't cover every single thing that we discussed.
Alex flipped to see who went first, and was White. He started with 1. e4, and Black replied 1... e5. Aphyer and Adam had more experience with the opening we would enter into than myself, and since they weren't willing to blow their covers immediately, they started by suggesting good moves, which Alex went along with.
After 2. Nf3 Nc6 3. Bc4, Black played 3... Nf6, which Aphyer and Adam said was a bit of a mistake because it allowed 4. Ng5. Alex went ahead, and we entered the main line from there - 4... d5 5. exd5 Na5.
Aphyer and Adam said the main line for move 6 was Bb5, but I wanted to hold onto the pawn if possible. I recommended 6. d3 in order to respond to 6... Nxd5 with 7. Qf3, and Alex agreed. Black played 6... Bg4, and although Adam recommended 7. Bb5, we eventually decided that was too risky and went with 7. f3. Afterwards, Adam suspected that his suggestion of 7. Bb5 may have tipped Alex off that he was dishonest - although the engine actually says 7. Bb5 was about as good as 7. f3.
After 7... Bf5, we discussed a few potential developing moves and decided on 8. Nc3. The game continued with 8... Nxc4 9. dxc4 h6 10. Nge4 Bb4. We considered Bd2, but decided that since the knights defended each other, castling was fine, and Alex castled. 11. O-O O-O.
Alex played 12. a3, and after 12... Nxe4, we discussed 13. fxe4, but didn't want to overcomplicate the position and instead just took back with 13. Nxe4. The game continued with 13... Be7 14. Be3 Bxe4 15. fxe4 Bg5. Although I strongly recommended trading to simplify the position, Aphyer advised Alex not to let him develop his queen to g5, and he quickly played 16. Bc5 instead.
Black played 16... Re8, and that was where we reached White's first big mistake of the game - 17. d6, which Adam suggested with little backlash. I saw that White would do well after 17... cxd6 or 17... c6, but I didn't notice Black's actual move: 17... b6. According to the engine, White should have just dropped back and given up the pawn, but I suggested a different line, which Alex went with: 18. d7 Re6 19. Bb4 a5 20. Bc3.
Black then played 20... c5, which was a bit of a mistake on its part. We then had a debate over whether to play 21. Qd3 or 21. Qd5, with Aphyer arguing that 21. Qd5 could get the queen trapped and was probably a plot by me to help Black. Alex trusted Aphyer and went with the more passive option, 21. Qd3. In reality, it turned out that this really was the best move - after 21. Qd5, Black would have had possibilities like 21... Ra7 and 21... Be3+ 22. Kh1 Bd4.
After 21... Ra7 22. Rad1 Re7, I suggested 23. Qh3, so that after 23... Raxd7 24. Rxd7 Qxd7 25. Qxd7 Rxd7 we could just play 26. Be5 and get the pawn back. Black would have 26... Re7, but I thought White could play 27. Bf4 and keep a bit of an advantage. 23... Raxd7 24. Rxd7 happened, but Black then played 24... Rxd7. Alex played 25. Bxe5 immediately, and the game continued with 25... Qe8 26. Bc3.
Black played 26... g6, which was a mistake, opening up the dark squares around the king. I suggested 27. Qf3, but nobody else agreed, and Alex played 27. e5 instead. After 27... Kh7, he then decided to go ahead with 28. Qf3. (Also, Adam started suggesting Kh1 around this time. Everyone thought this was a bad move, but he continued to suggest it for a while.)
Black then played 28... f6. I don't know what was going on inside Komodo's analogue of a brain here, but yes, this was just a free pawn. Alex played 29. exf6. (The engine says 29. h4 would have been more accurate, presumably to prevent back-rank mates, but we didn't really consider that.) Black made yet another simple mistake after that: 29... Kg8. I suggested 30. f7 and just winning in the simplified endgame, but everyone was naturally very suspicious of a positional sacrifice like that, and Alex played 30. Qc6 instead.
From there, the game was just an easy win for White - 30... h5 31. f7+ Qxf7 32. Rxf7 Be3 33. Kf1 Rxd7 34. Ke2 Bd4 35. Bxd4 cxd4 36. Qxg6+ Rg7 37. Qe6 Kf8. During these last few moves, Adam admitted he was one of the liars and resigned. I said that I had wanted to ask everyone who they thought the liar was afterwards (but had entirely forgotten to mention that earlier), so Aphyer private messaged me when he was ready to resign. I told Alex that whichever one of us was the liar had resigned, and the game ended.
Conclusion
Adam said that by the end he was 90% sure I was the honest one, while Alex put 70% on Aphyer being honest.
My greatest mistake was definitely supporting 21. Qd5, but on a meta level I wonder whether I should have actually been less straightforward in my presentation of what I believed. In theory, there's a difference between optimizing for Alex to win, and being completely honest to Alex, and it might have been better for me to have been more strategic about my presentation. As in, not suggesting suspicious-looking moves like 30. f7, even though I thought they were right. Optimizing in someone's favor by not being completely honest with them sure is a really risky sort of thing to do, and I doubt I really could have pulled it off all that well, but it's something to take into consideration in the real-world AI scenario.
So what do we learn from this? The simple result was that Alex really did win the game, and that's a positive update! It suggests people really would be able to discern the truth in the AI scenario. But some of the details of the game make me worry that it won't actually generalize to AI and alignment.
To start with, there's the smaller skill gap than we would have preferred - perhaps the effect vanishes between noobs and Magnus Carlsen - as well as the shorter time control. There's also the part where in the real world, an unaligned AI will probably not accidentally correct another's Qd5 mistakes. And finally, a major reason that Aphyer and Adam didn't fool Alex was because for the first half of the game, they were mostly suggesting truly good moves in order to try to gain trust. In the real-world scenario, there's some element of that - presumably the only plans that could get close to fooling someone are those containing a few lies mixed in with a lot of truth. But I worry that there are going to be a lot more opportunities for the AIs to deceive the metaphorical Alexes than just once or twice - and it could only take one mistake to lose the game.
Still, it's an update in a hopeful direction! I'll run more experiments once I get more participants whose ratings and schedules fit well with each other, increase the time controls when possible, and get a lot more data than just one game.
Alex's comments:
Aphyer's comments: