Notes on my performance:
Well, I feel pretty dumb (which is the feeling of becoming smarter). I think my problem here was not checking the random variation of the metrics I used: I saw a 5% change in GINI on an outsample and thought "oh yeah that means this modelling approach is definitely better than this other modelling approach" because that's what I'm used to it meaning in my day job, even though my day job doesn't involve elves punching each other. (Or, at least, that's my best post hoc explanation for how I kept failing to notice simon's better model was indeed better; it could also have been down to an unsquished bug in my code, and/or LightGBM not living up to the hype.)
ETA: I have finally tracked down the trivial coding error that ended up distorting my model: I accidentally used kRace in a few places where I should have used kClass while calculating simon's values for Speed and Strength.
Notes on the scenario:
I thought the bonus objective was executed very well: you told us there was Something Else To Look Out For, and provided just enough information that players could feel confident in their answers after figuring things out. I also really liked the writing. Regarding the actual challenge part of the challenge . . . I'm recusing myself from having an opinion until I figure out how I could have gotten it right; all I can tell you for sure is this wasn't below 4/5 Difficulty. (Making all features' effects conditional on all other features' effects tends to make both Analytic and ML solutions much trickier.)
ETA: I now have an opinion, and my opinion is that it's good. The simple-in-hindsight underlying mechanics were converted seamlessly into complex and hard-but-fair-to-detangle feature effects; the flavortext managed to stay relevant without dominating the data. This scenario also fits in neatly alongside earlier entries with superficially similar premises: we've had "counters matter" games, "archetypes matter" games, and now a "feature engineering matters" game.
I have exactly one criticism, which is that it's a bit puzzlier than I'd have liked. Players get best results by psychoanalyzing the GM and exploiting symmetries in the dataset, even though these aren't skills which transfer to most real-world problems, and the real-world problems they do transfer to don't look like "who would win a fight?"; this could have been addressed by having class and race effects be slightly more arbitrary and less consistent, instead of having uniform +Strength / -Speed gaps for each step. However, my complaint is moderated by the facts that:
.This is an isekai-world, simplified mechanics and uncannily well-balanced class systems come with the territory. (I thought the lack of magic-users was a tell for "this one will be realistic-ish" but that's on me tbh.)
.Making the generation function any more complicated would have made it (marginally but nontrivially) less elegant and harder to explain.
.I might just be being a sore loser only-barely-winner here.
.Puzzles are fun!
ETA: I have finally tracked down the trivial coding error that ended up distorting my model: I accidentally used kRace in a few places where I should have used kClass while calculating simon's values for Speed and Strength.
Thanks for looking into that: I spent most of the week being very confused about what was happening there but not able to say anything.
Thanks aphyer, this was an interesting challenge! I think I got lucky with finding the
power/speed mechanic early - the race-class matchups
really didn't, I think, in principle have enough info on their own to make a reliable conclusion from but enabled me to make a genre savvy guess which I could refine based on other info - in terms of scenario difficulty though I think it could have been deducible in a more systematic way by e.g.
looking at item and level effects for mirror matches.
abstractapplic and Lorxus's discovery of
persistent level 7 characters,
and especially SarahSrinivasan's discovery of
the tournament/non tournament structure
meant the players collectively were I think quite a long ways towards fully solving this. The latter in addition to being interesting on its own is very important to finding anything else about the generation due to its biasing effects.
I agree with abstractapplic on the bonus objective.
One learning experience for me here was trying out LLM-empowered programming after the initial spreadsheet-based solution finding. Claude enables quickly writing (from my perspective as a non-programmer, at least) even a relatively non-trivial program. And you can often ask it to write a program that solves a problem without specifying the algorithm and it will actually give something useful...but if you're not asking for something conventional it might be full of bugs - not just in the writing up but also in the algorithm chosen. I don't object, per se, to doing things that are sketchy mathematically - I do that myself all the time - but when I'm doing it myself I usually have a fairly good sense of how sketchy what I'm doing is*, whereas if you ask Claude to do something it doesn't know how to do in a rigorous way, it seems it will write something sketchy and present it as the solution just the same as if it actually had a rigorous way of doing it. So you have to check. I will probably be doing more of this LLM-based programming in the future, but am thinking of how I can maybe get Claude to check its own work. Some automated way to pipe the output to another (or the same) LLM and ask "how sketchy is this and what are the most likely problems?". Maybe manually looking through to see what it's doing, or at least getting the LLM to explain how the code works, is unavoidable for now.
* when I have a clue what I'm doing which is not the case, e.g. in machine learning.
I found myself having done some data exploration but without time to focus and go much deeper. But also with a conviction that bouts were determined in a fairly simple way without persistent hidden variables (see Appendix A). I've done work with genetic programming but it's been many years, so I tried getting ChatGPT-4o w/ canvas to set me up a good structure with crossover and such and fill out the various operation nodes, etc. This was fairly ineffective; perhaps I could have better described the sort of operation trees I wanted, but I've done plenty of LLM generation / tweak / iterate work, and it felt like I would need a good bit of time to get something actually useful.
That said, I believe any halfway decently regularized genetic programming setup would have found either the correct ruleset or close enough that manual inspection would yield the right guess. The setup I had begun contained exactly one source of randomness: an operation "roll a d6". :D
Appendix A: an excerpt from my LLM instructions
I believe the hidden generation is a simple fairly intuitive simulation. For example (this isn't right, just illustrative) maybe first we check for range (affected by class), see if speed (affected by race and boots) changes who goes first, see if strength matters at all for whether you hit (race and gauntlets), determine who gets the first hit (if everything else is tied then 50/50 chance), and first hit wins. Maybe some simple dice rolls are involved.
Yeah, my recent experience with trying out LLMs has not filled me with confidence.
In my case the correct solution to my problem (how to use kerberos credentials to authenticate a database connection using a certain library) was literally 'do nothing, the library will find a correctly-initialized krb file on its own as long as you don't tell it to use a different authentication approach'. Sadly, AI advice kept inventing ways for me to pass in the path of the krb file, none of which worked.
I'm hopeful that they'll get better going forward, but right now they are a substantial drawback rather than a useful tool.
Thank you for posting this. Overall I would rate this as a middle of the road (ie good) scenario. Complexity 3/5, quality 3/5.
I thought the bonus objective was in principal a good addition, though it could have done with an extra couple of known words. As it is unless you spot the anomaly with Cadagals boots it seems next to impossible to figure out what it might mean.
Overall I think a gap of about 2 months is better than short gaps of one month followed by longer gaps of several months. Though possibly not this time as that would put it right in the middle of the Christmas period!
I think the bonus objective was a good idea in theory but not well tuned. It suffered from the classic puzzle problem of the extraction being the hard part, rather than the cool puzzle being the hard part.
I think it was perfectly reasonable to expect that at some point a player would group by [level, boots] and count and notice there was something to dig into.
But, having found the elf anomaly, I don't think it was reasonable to expect that a player would be able to distinguish between
It's perfectly reasonable to expect that a player could generate a number of hypotheses and guess that the most likely was that they shouldn't reveal the +4 boots at all, but they would have no real way of confirming that guess; the fact that they're rewarded for guessing correctly is probably better than the alternative but is not satisfying IMO.
I think this is just an unavoidable consequence of the bonus objective being outside-the-box in some sense: any remotely-real world is much more complicated than the dataset can ever be.
If you were making this decision at a D&D table, you might want to ask the GM:
I can't realistically explain all of these up front in the scenario! And this is just the questions I can think of - in my last scenario (linked comment contains spoilers for that if you haven't played it yet) the players came up with a zany scheme I hadn't considered myself.
Overall, I think if you realized that the +4 Boots in your inventory came from the Elf Ninja you can count yourself as having accomplished the Bonus Objective regardless of what you decided to do with them. (You can imagine that you discussed the matter with the GM and your companions, asked all the questions above, and made a sensible decision based on the answers).
I liked this one! I was able to have significant amounts of fun with it despite perennial lack-of-time problems.
Pros:
Cons:
It feels like this scenario should be fully knowably solvable, given time, except for the bonus guess at the end, which is very cool.
I didn't enjoy this one as much, but that's likely down to not having had the time/energy to spend on thinking this through deeply. That said... I did not in fact enjoy it as much and I mostly feel like garbage for having done literally worse than chance, and I feel like it probably would have been better if I hadn't participated at all.
I don't think you should feel bad about that! This scenario was pretty complicated and difficult, and even if you didn't solve it I think "tried to solve it but didn't quite manage it" is more impressive than "didn't try at all"!
This is a follow-up to last week's D&D.Sci scenario: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.
There is a web interactive here you can use to test your answer, and generation code available here if you're interested, or you can read on for the ruleset and scores.
RULESET
A character's class/race/level/items boil down to giving them scores in two separate stats: Speed and Power.
Stat Calculation
Each parameter of a character has an effect on their stats:
So, for example, the following characters all have Speed 10 and Power 10:
These characters are, from a mechanical perspective, all completely identical. They will behave the same in every possible matchup.
Particular congratulations are due here to simon, who managed to figure out this whole mapping within 24 hours (actually posting an essentially-complete solution before the scenario even made it to the frontpage).
Combat
When two characters fight, each one calculates a Combat Score:
So, taking these three characters:
and considering the matchups between them:
This means that there are in general two good strategies:
STRATEGY
The four House Champions had the following stats (including their items):
Your six characters had the following stats (before giving them items):
First, you want to decide your Speed matchups:
Then you want to allocate your Power boosts optimally:
BONUS OBJECTIVE
If you noticed the right things in the dataset, you would find that:
The Bonus Objective was to realize that your +4 Boots in fact belong to House Cadagal, and that using them is a bad idea:
Congratulations to everyone who figured this out, particularly to abstractapplic, who was the first person to get a reasonably-complete explanation of what had happened.
LEADERBOARD
Congratulations to all players, particularly to simon, whose perfect figuring-out of mechanics led to a well-deserved perfect score!
DATASET GENERATION
The bulk of fights in the dataset were from the Arena's regular tournaments, where 64 competitors take part in a single-elimination tournament. The initial competitors are biased towards low levels, but by later rounds it's usually higher-level characters who remain.
These tournaments are interspersed with occasional duels (used as part of the charming local system of governance to resolve various issues), which are between relatively high-level characters. Duels occur a bit more than twice as often as tournaments, but since each duel is only a single fight they're a very small minority of fights in the data.
Congratulations to SarahSrinivasan for identifying this!
REFLECTIONS & FEEDBACK REQUEST
The goal I was shooting for with this scenario was to have the simplest underlying system I could manage that still led to interesting emergent behavior.
I felt fairly happy with how this went! There were a lot of sneaky interactions in the data - e.g. as simon's analysis said:
but once analyzed in detail it was possible to disentangle what led to this.
How did this feel from a player perspective?
I'm also curious how the Bonus Objective felt from a player perspective: did people like it? Did it seem fair? Should I clue players to the existence of such an objective in future?
As usual, I'm also interested to hear any other feedback on what people thought of this scenario. If you played it, what did you like and what did you not like? If you might have played it but decided not to, what drove you away? What would you like to see more of/less of in future? Do you think the scenario was more complicated than you would have liked? Or too simple to have anything interesting/realistic to uncover? Or both at once? Did you like/dislike the story/fluff/theme parts? What complexity/quality scores should I give this scenario in the index?
I'm currently planning to post another scenario in about a month: I would shoot for Nov 29th, except that's Thanksgiving weekend so I might end up going a bit earlier/later. Let me know if some times work well/poorly for you!
Entertainingly, these 1d6 rolls were the only part of the ruleset that simon did not deduce.
You could swap Willow and Zelaya here, but it will make your Power allocation worse in the next step.
There was supposed to be only one of each of these, but due to a bug there were two copies of the Dwarf Monk starting about halfway through the dataset. simon noticed this was a bug, but I think I'm going to go with abstractapplic's interpretation of there being two Dwarf Monks. Perhaps he was duplicated by the Mirror Mage.
Did you steal them directly from House Cadagal? Or did you take them from a Thieves' Guild that stole them from House Cadagal? That's really up to you: the Goddess summons Heroes with a wide variety of moralities. But House Cadagal is not likely to be amused either way.