I liked this one! I was able to have significant amounts of fun with it despite perennial lack-of-time problems.
Pros:
Cons:
It feels like this scenario should be fully knowably solvable, given time, except for the bonus guess at the end, which is very cool.
I think the bonus objective was a good idea in theory but not well tuned. It suffered from the classic puzzle problem of the extraction being the hard part, rather than the cool puzzle being the hard part.
I think it was perfectly reasonable to expect that at some point a player would group by [level, boots] and count and notice there was something to dig into.
But, having found the elf anomaly, I don't think it was reasonable to expect that a player would be able to distinguish between
It's perfectly reasonable to expect that a player could generate a number of hypotheses and guess that the most likely was that they shouldn't reveal the +4 boots at all, but they would have no real way of confirming that guess; the fact that they're rewarded for guessing correctly is probably better than the alternative but is not satisfying IMO.
I found myself having done some data exploration but without time to focus and go much deeper. But also with a conviction that bouts were determined in a fairly simple way without persistent hidden variables (see Appendix A). I've done work with genetic programming but it's been many years, so I tried getting ChatGPT-4o w/ canvas to set me up a good structure with crossover and such and fill out the various operation nodes, etc. This was fairly ineffective; perhaps I could have better described the sort of operation trees I wanted, but I've done plenty of LLM generation / tweak / iterate work, and it felt like I would need a good bit of time to get something actually useful.
That said, I believe any halfway decently regularized genetic programming setup would have found either the correct ruleset or close enough that manual inspection would yield the right guess. The setup I had begun contained exactly one source of randomness: an operation "roll a d6". :D
Appendix A: an excerpt from my LLM instructions
I believe the hidden generation is a simple fairly intuitive simulation. For example (this isn't right, just illustrative) maybe first we check for range (affected by class), see if speed (affected by race and boots) changes who goes first, see if strength matters at all for whether you hit (race and gauntlets), determine who gets the first hit (if everything else is tied then 50/50 chance), and first hit wins. Maybe some simple dice rolls are involved.
Given equal level, race, and class, regardless of gauntlets, better boots always wins, no exceptions.
A very good predictor of victory for many race/class vs race/class matchups is the difference in level+boots plus a static modifier based on your matchup. Probably when it's not as good we should be taking into account gauntlets. But also ninjas seem to maybe just do something weird. I'm guessing a sneak attack of some sort.
Anyway just manually matching up our available gladiators yields this setup which seems extremely likely to simply win:
# Elf Knight to beat Human Warrior 9 with just 1 adv. Needs Boots 3+
# Elf Fencer to beat Human Knight 9 by a lot but gauntlets might matter. Boots +1 are fine. Send Gauntlets +2.
# Human Monk to beat Elf Ninja 9 with 3 adv but gauntlets might matter. Needs Boots 2+. Send Gauntlets +3.
# Human Ranger to beat Dwarf Monk 9 with just 1 adv. Needs Boots 4+
aka
Give Zelaya the +3 Boots of Speed and the +1 Gauntlets of Power and send them to fight House Adelon's champion.
Give Yalathinel the +1 Boots of Speed and the +2 Gauntlets of Power and send them to fight House Bauchard's champion.
Give Xerxes III the +2 Boots of Speed and the +3 Gauntlets of Power and send him to fight House Cadagal's champion.
Give Willow the +4 Boots of Speed and send her to fight House Deepwrack's champion.
Do not send Uzben or Varina to fight at all.
The problem is that the Elf Ninja might want their +4 Boots. Or might want us to definitely not use them. Or something. As-is, we win; if the Elf Ninja is gonna be irate afterwards maybe winning isn't enough, but I dunno how to reliably win without using the +4 Boots. We can certainly try to schedule Willow's fight first, then after the fight against House Cadagal we can gift the +4 Boots back. I think the only better alternative is if it turns out the Elf Ninja is actually willing to throw the match for the +4 Boots back and be friendly with us afterwards, in which case probably there are better ways to set this up.
I haven't yet gotten into any stats or modeling, just some data exploration, but there's some things I haven't seen mentioned elsewhere yet:
Zeroth: the rows are definitely in order! First: the arena holds regular single-elimination tournaments with 64 participants (63 total rounds) and these form contiguous blocks in the dataset with a handful of (unrelated?) bonus rounds in between. Second: Maybe the level 7 Dwarf Monk stole (won?) those +4 boots by winning a tournament (the Elf Ninja's last use was during a final round vs that monk!) and then we acquired the boots from that monk? They appear to have upgraded their boots once before from +1 to +3 when defeating a Dwarf Ninja, though that was during a bonus round, not a tournament.
Does the fact that we see the winners of tournaments 6x more often than those eliminated in round one matter for modeling? It might; if e.g. gladiators have a hidden "skill" stat but for some reason the house champions don't have very high skill, we'll be implicitly significantly overestimating their hidden skill stat.
Not to toot my own horn* but we detected it when I was given the project of turning some of our visualizations into something that could accept QA's format so they could look at their results using those visualizations and then I was like "... so how does QA work here, exactly? Like what's the process?"
I do not know the real-world impact of fixing the overfitting.
*tooting one's own horn always follows this phrase
Once upon a time I worked on language models and we trained on data that was correctly split from tuning data that was correctly split from test data.
And then we sent our results to the QA team who had their own data, and if their results were not good enough, we tried again. Good enough meant "enough lift over previous benchmarks". So back and forth we went until QA reported success. On their dataset. Their unchanging test dataset.
But clearly since we correctly split all of our data, and since we could not see the contents of QA's test dataset, no leakage could be occurring.
it is absolutely true that it people find it frustrating losing to players worse than them, in ways that feel unfair. Getting used to that is another skill, similar to the one described above, where you have to learn to feel reward when you make a positive EV decision, rather than when you win money
This is by far the most valuable thing I learned from poker. Reading Figgie's rules, it does seem like Figgie would teach it too, and faster.
The most common reason I've seen for "modafinil isn't great for me" is trying to use it for something other than
Feeling pain after hearing a bad joke. "That's literally painful to hear" is self-reportedly (I say in the same way I, without a mind's eye, would say about mind's-eye-people) actually literal for some people.