Agreed, re: the limitations of my method. As you suggested, I ran another pass using only the top 7 candidates (wins >= 19 in my previous comment). Here are the results:
3: blue/red
5: blue/green
7: blue/blue
7: green/green
7: green/red
9: green/blue
11: green/yellow
Choosing the top 10 (wins >= 17 from before):
7: blue/red
7: red/green
9: green/green
9: green/red
11: blue/blue
11: blue/green
11: blue/yellow
11: green/blue
11: yellow/yellow
13: green/yellow
Yellow/yellow pops up as a surprise member of the 5-way tie for second place. The green sword is less effective once you introduce these new members. There are probably a lot of surprises if you keep varying the members you allow. And all of this still assumes a normal distribution, which is unlikely.
Pursuing this stupidity to its logical conclusion, I just did an elimination match with 16 rounds. Start with all combinations and cull the weakest member every round. Here's the result: http://pastie.org/1217255
Note the culling is sometimes arbitrary if there's a tie for last place. By pass 14, we have a 3-way tie between blue/blue, blue/green, and green/yellow. Those may very well be the best three combinations, or close to it.
Final version of program here: http://pastie.org/1217284
(Removed randomness and just factored in the probability of evasion i...
Note: this image does not belong to me; I found it on 4chan. It presents an interesting exercise, though, so I'm posting it here for the enjoyment of the Less Wrong community.
For the sake of this thought experiment, assume that all characters have the same amount of HP, which is sufficiently large that random effects can be treated as being equal to their expected values. There are no NPC monsters, critical hits, or other mechanics; gameplay consists of two PCs getting into a duel, and fighting until one or the other loses. The winner is fully healed afterwards.
Which sword and armor combination do you choose, and why?