Yay, I wasn't last!
Still, I'm not surprised that laziness did not pay off. I wrote a simple bot, then noticed that it cooperated against defectbot and defected against itself. I thought to myself, "This is not a good sign." Then I didn't bother changing it.
Can I suggest running a second tournement, including all these bots and new bots that we can examine these bots in writing? After a few iterations, we might beat random. Or we might develop all sorts of signaling quirks, but that would also be interesting.
When I first announced the tournament, people came up with many suggestions for modifications to the rules, most of which I did not take for the sake of simplicity. Maybe we should try to figure out what rules will make it easier to write more effective bots, even if that would disqualify many of the bots already submitted to this tournament.
I went and ran this another 100 times, so I could see what it would look like without the variance. The mean scores are:
A: 32.03
B: 28.53
C: 32.48
D: 24.94
E: 28.75
F: 29.62
G: 28.42
H: 26.12
I :26.06
J: 26.10
K: 36.15
L: 27.21
M: 25.14
N: 34.37
O: 31.06
P: 26.55
Q: 34.95
R: 32.93
S: 37.08
T: 26.43
U: 24.24
If you're interested, here's the code for the test (takes a day to run) and the raw output for my run (an inconvenient format, but it shows the stats for the matchups).
And for the lazy, here these scores are in sorted order (with original scores in parentheses):
S: 37.08 - Quinn (33)
K: 36.15 - selbram (34)
Q: 34.95 - Margaret Sy (39)
N: 34.37 - BloodyShrimp (34)
R: 32.93 - So8res, NateBot (33)
C: 32.48 - THE BLACK KNIGHT (36)
A: 32.03 - rpglover64 (38)
O: 31.06 - caa (32)
F: 29.62 - Billy, Mimic-- (27)
E: 28.75 - Devin Bayer (30)
B: 28.53 - Watson Ladd (27)
G: 28.42 - itaibn (34)
L: 27.21 - Alexei (25)
P: 26.55 - nshepperd (25)
T: 26.43 - HonoreDB (23)
H: 26.12 - CooperateBot (24)
J: 26.1 - oaz (26)
I: 26.06 - Sean Nolan (28)
M: 25.14 - LEmma (25)
D: 24.94 - skepsci (24)
U: 24.24 - SlappedTogetherAtTheLastMinuteBot (20)
Has someone gone through all the programs to figure out what strategy they're supposed to be implementing, then checked that they in fact correctly implement that strategy? Also, given that the bots were allowed to use randomness, it would've been a good idea to average a bunch of rounds together.
A tournament like this would be much more interesting if it involved multiple generations. Here, the results heavily depended upon the pool of submitted strategies, regardless of their actual competitiveness, while a multiple-generations tournament would measure success as performance against other successful strategies.
h, i, and j are identical and received noticeably different scores. Maybe next time, repeat the whole process a few times to even out the RNG?
Also, it would be interesting to see the head-to-head matchups, especially to see if R's complexity actually helped even though it seemed mostly aimed at special cases that never came up.
Hmm. Considering the heavy amount of random strategies, it seems like it might be worth it iterating the tournament several more times and summing the point totals, particularly since C(one of the randoms) is above the three way tie at 4th place between G, K and N by only two points, and A and C also have a 2 point difference despite having functionally identical strategies (50% random Cooperate or Defect.) I feel like that implies standard deviations may make analysis of the results premature.
Nonetheless, analyzing is fun, so I'll do a little bit anyway on the fourth placers.
K appears to be functionally identical to a Defect bot. It has more complicated code, but that code never cooperated. (Side note: randomly, in this run, Margaret Sy's winning random favor defection at 80% strategy only cooperated with this bot and defected against all other bots.)
Also G and N, despite being in 4th place, didn't actually seem to take successfully advantage of any of the 3 pseduosimplistic cooperate bots in this run (H, I, J - Simulate something, then cooperate regardless.) by defecting against them, which would seem to be one of the key ways a mind reader could try to score points for a win. I ...
I don't know quine at all and can't easily understand exactly what all these guys are doing but:
This is obviously NOT a stable metagame; in both senses.
If we ran this tournament again the random-bots would get utterly destroyed (since it's very easy/fast to check for them and defect). If we ran this tournament with multiple rounds, where successful strategies survive and unsuccessful die, and there was even one defect-against-randombot code in the mix, it will win if it doesn't die early.
My guess-happy summary of what happened is this: The core problem is very hard, so most people instead did something simple, often very simple (cooperate bot, defect bot, random bot - which won!) with a small number of people trying for real. The people trying for real, however, spent all their energy planning for other people trying for real and trying to fool them, rather than trying to act correctly against all the super-simple programs, because if you put in that much work you think other people will mostly put in at least some work (R has an entire long blog post of metagame and strategic analysis which is COMPLETELY wrong in exactly this way, in real life I do this a lot). That, combined wit...
One additional note on a second tournament: I notice bots weren't played against themselves. Should this be done? It seems useful as a sanity check; I feel like a bot that won't even cooperate against itself ought to get fewer points.
Why? In what circumstance would you ever be offered a choice to cooperate or compete with your concurrent self?
Probably in a respectable fraction of the universes in which other agents have a copy of your source code.
Huh. I'm not entirely surprised that random behavior was dominant, but it's certainly depressing. Nevertheless, I'm glad you ran this.
This may be more a reflection that programming interesting strategies which run each others code and fully do what you want them to do is tough. It might be interesting to look at the strategies which did look at the other's code and see if any are responding in an obviously wrong fashion in that particular instance.
I wasn't expecting there to be so much randomness. If I had been, I guess I would have spent the time to figure out how to test for it in some kind of loop and defect if the opponent was random. Looping in lisp is unnecessarily complicated though, so I didn't bother...
That suggests that a large part of this behavior may have been due to limited amounts of programming time.
Fantastic work, thank you =)
For anyone else unpacking the zip file, note you'll want to create a new directory to unzip it into, not just unzip it in the middle of your home folder.
Wait, someone actually submitted CooperateBot? That wasn't just a default entry?
Thoughts on a possible second tournament:
As one of the players who submitted a cooperatebot (yes, I see the irony), allow me to explain my reasoning for doing so. I scoped the comments to see what bots were being suggested (mimicbots, prudentbots, etc) and I saw much more focus on trying to enforce mutual cooperation than trying to exploit bots that can be exploited. I metagamed accordingly, hypothesizing that the other bots would cooperate with a cooperatebot but possibly fail to cooperate with each other. My hypothesis was incorrect, but worth testing IMO.
Considering this was an experimental tournament, learning how certain strategies perform against others seems far more interesting to me than winning, and I can't imagine any strategy I would label as a troll submission. Even strategies solely designed to be obstacles are valid and valuable contributions, and the fact that random strategies skew the results is a fault of the tournament rules and not of the strategies themselves.
:(
I didn't realise that when you said programs should assume eval
would work normally, that meant I needed to make my simulated eval
take a namespace as an optional argument (and use the proper namespace containing the standard library as a default). So my program crashed whenever it simulated something that uses (eval x)
. There goes my clever algorithm, lol
Oh well. I don't think I would have won anyway. The problem seems to be that the REPL behaves very differently to non-REPL contexts, which I certainly wasn't expecting. Actually, it looks like eval
in the REPL uses the REPL namespace as a default (so for example you can type (eval '(define a 2)) (write a)
and see a 2
printed out) which seems rather unconscionable to me¹, especially since the REPL namespace includes a boatload of miscellaneous modules such as racket/pretty
and racket/function
and probably others as well.
Possibly next time it would be useful to publish a "test harness" ahead of time which runs contestants against each other in a clean (or at least well-specified) namespace, and make it clear that the actual contest would be run using this code.
¹ Actually, now that I've noticed this, it looks like you could use this to write a bot that corrupts the global (REPL) namespace, and therefore the namespace of anyone trying to simulate it, maybe even forcing the opponent to cooperate by rewriting if
or something. Ha!
If only THE BLACK KNIGHT had won the tournament, I would've revealed my identity then. This way, the identity of THE BLACK KNIGHT shalt forevermore remain a mystery. Giddyup!
In more seriousness though, how could this happen, being worse than random? Just lacking a clause filtering for the string 'random' (that could be exploitable, too), or alternatively, not simulating the opponent, say, 10 times, then going with the majority behavior if there is one, and if there's not (6-5, 5-5, 4-6), treating the opponent as randombot. Won't filter all randombots, of c...
Given Rice's theorem, there is no source code analysis that is guaranteed to tell you anything interesting about your opponent, unless your opponent has been written in such a way as to facilitate that analysis. This suggests two possible modifications of the competition.
Firstly, the rule could be imposed that source code is unanalysable. A competing program will not be furnished with its opponent's source, but only an opaque blob. The only operation they can perform on this blob is to supply it with another blob and see what answer it gives. Each program ...
Sorry if this is a stupid question, but this tournament looked to me like a thinly disguised version of:
"Construct a robot that can read code and interpret what it means."
which is a Really Hard Problem.
Is that not a fair description? Was there some other way to approach the problem?
The only way I can see to go about constructing a GOOD entrant to this is to write something that can take as its input the code of the opponent and interpret what it will actually DO, that can recognize the equivalence between (say):
return DEFECT
and
if 1: return ...
Some notes I've made on the entries:
(Note, this is largely from the perspective of "I want to write a bot that works by doing lots of static analysis to the other bots", since I was working on such a bot. In fact I can see now that the idea I had in mind didn't actually make sense, but, oh well, I have an idea for how to fix it for the next contest...)
First of all, thanks for running the contest Alex and congrats to the winners.
Random behaviour was obviously a good strategy, but I don't feel it teaches us much about making better bots. If I had suspected so much random behaviour, defecting whenever it's detected would have been a good additional test.
Hmm, three way tie for fourth place.
I thought So8res was trying to set up subtle vulnerabilities for the mimicbot clique to exploit himself! My own exploit (store their size and see if it changes; if so defect, on the assumption that most who change their size will only change it once; play small-n mimicbot so you can induce cooperation on their turn and when the exploit doesn't fire at all) seems to have cut the problem a little cleaner, picking up more nonrandom DCs. I had essentially thought of this exploit in the first few days of the contest, but I ha...
Sorry if this was answered somewhere else, but do you plan on hosting another tournament? If so, when?
One more question about rules for a second contest: Do we want to keep the rule as stated that "Each player will submit a file containing a single Scheme lambda-function. The function should take one input.", or should we relax it to "Each player will submit a file containing a single Scheme expression, that must evaluate to a function taking one input"?
Benefits of the latter: Allows quines to be written less artificially (although, if we switch to the two-argument version, this becomes unimportant), is just generally more flexible.
Bene...
I would have expected a few more metaevaluator bots which clean up the environment to prevent detection. Of course this can be an expensive strategy, and certainly is in programmer time. A metaevaluator bot would have probably broken several recursion detection strategies, or even defining set! out of the language.
The prisoner's dilemma tournament is over. There were a total of 21 entries. The winner is Margaret Sy, with a total of 39 points. 2nd and 3rd place go to rpglover64 and THE BLACK KNIGHT, with scores of 38 and 36 points respectively. There were some fairly intricate strategies in the tournament, but all three of these top scorers submitted programs that completely ignored the source code of the other player and acted randomly, with the winner having a bias towards defecting.
You can download a chart describing the outcomes here, and the source codes for the entries can be downloaded here.
I represented each submission with a single letter while running the tournament. Here is a directory of the entries, along with their scores: (some people gave me a term to refer to the player by, while others gave me a term to refer to the program. I went with whatever they gave me, and if they gave me both, I put the player first and then the program)
A: rpglover64 (38)
B: Watson Ladd (27)
c: THE BLACK KNIGHT (36)
D: skepsci (24)
E: Devin Bayer (30)
F: Billy, Mimic-- (27)
G: itaibn (34)
H: CooperateBot (24)
I: Sean Nolan (28)
J: oaz (26)
K: selbram (34)
L: Alexei (25)
M: LEmma (25)
N: BloodyShrimp (34)
O: caa (32)
P: nshepperd (25)
Q: Margaret Sy (39)
R: So8res, NateBot (33)
S: Quinn (33)
T: HonoreDB (23)
U: SlappedTogetherAtTheLastMinuteBot (20)