Comment Permalink

Notes on my performance:

Well, I feel pretty dumb (which is the feeling of becoming smarter). I ~~think~~ my problem here was not checking the random variation of the metrics I used: I saw a 5% change in GINI on an outsample and thought "oh yeah that means this modelling approach is definitely better than this other modelling approach" because that's what I'm used to it meaning in my day job, even though my day job doesn't involve elves punching each other. (Or, at least, that's my best post hoc explanation for how I kept failing to notice simon's better model was indeed better; it could also have been down to an unsquished bug in my code~~, and/or LightGBM not living up to the hype.)~~

ETA: I have finally tracked down the trivial coding error that ended up distorting my model: I accidentally used kRace in a few places where I should have used kClass while calculating simon's values for Speed and Strength.

Notes on the scenario:

I thought the bonus objective was executed very well: you told us there was Something Else To Look Out For, and provided just enough information that players could feel confident in their answers after figuring things out. I also really liked the writing. Regarding the actual challenge part of the challenge . . . ~~I'm recusing myself from having an opinion until I figure out how I could have gotten it right; all I can tell you for sure is~~ this wasn't below 4/5 Difficulty. (Making all features' effects conditional on all other features' effects tends to make both Analytic and ML solutions much trickier.)

ETA: I now have an opinion, and my opinion is that it's good. The simple-in-hindsight underlying mechanics were converted seamlessly into complex and hard-but-fair-to-detangle feature effects; the flavortext managed to stay relevant without dominating the data. This scenario also fits in neatly alongside earlier entries with superficially similar premises: we've had "counters matter" games, "archetypes matter" games, and now a "feature engineering matters" game.

I have exactly one criticism, which is that it's a bit puzzlier than I'd have liked. Players get best results by psychoanalyzing the GM and exploiting symmetries in the dataset, even though these aren't skills which transfer to most real-world problems, and the real-world problems they do transfer to don't look like "who would win a fight?"; this could have been addressed by having class and race effects be slightly more arbitrary and less consistent, instead of having uniform +Strength / -Speed gaps for each step. However, my complaint is moderated by the facts that:

.This is an isekai-world, simplified mechanics and uncannily well-balanced class systems come with the territory. (I thought the lack of magic-users was a tell for "this one will be realistic-ish" but that's on me tbh.)

.Making the generation function any more complicated would have made it (marginally but nontrivially) less elegant and harder to explain.

.I might just be being a sore ~~loser~~ only-barely-winner here.

.Puzzles are fun!

aphyer6mo20

ETA: I have finally tracked down the trivial coding error that ended up distorting my model: I accidentally used kRace in a few places where I should have used kClass while calculating simon's values for Speed and Strength.

Thanks for looking into that: I spent most of the week being very confused about what was happening there but not able to say anything.

See in context

47 D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset

by aphyer

29th Oct 2024

8 min read

47

This is a follow-up to last week's D&D.Sci scenario: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.

There is a web interactive here you can use to test your answer, and generation code available here if you're interested, or you can read on for the ruleset and scores.

RULESET

A character's class/race/level/items boil down to giving them scores in two separate stats: Speed and Power.

Stat Calculation

Each parameter of a character has an effect on their stats:

Class: Your class provides base stats as follows:
- Knight: 2 Speed/12 Power
- Warrior: 4 Speed/10 Power
- Ranger: 6 Speed/8 Power
- Monk: 8 Speed/6 Power
- Fencer: 10 Speed/4 Power
- Ninja: 12 Speed/2 Power
Race: Your race modifies stats as follows:
- Dwarf: -3 Speed, +3 Power
- Human: No Effect
- Elf: +3 Speed, -3 Power
Level: Each level provides +1 Speed and +1 Power.
Items: Boots of Speed +X grant +X Speed, Gauntlets of Power +Y grant +Y Power.

So, for example, the following characters all have Speed 10 and Power 10:

A Level 3 Elf Warrior (Speed 4 + 3 + 3, Power 10 - 3 + 3)
A Level 3 Dwarf Fencer (Speed 10 - 3 + 3, Power 4 + 3 + 3)
A Level 2 Human Ranger with Boots of Speed +2 (Speed 6 + 2 + 2, Power 8 + 2)
A Level 2 Human Monk with Gauntlets of Power +2 (Speed 8 + 2, Power 6 + 2 + 2)
A Level 1 Elf Knight with Boots of Speed +4 (Speed 2 + 3 + 1 + 4, Power 12 - 3 + 1)
A Level 1 Dwarf Ninja with Gauntlets of Power +4 (Speed 12 - 3 + 1, Power 2 + 3 + 1 + 4)

These characters are, from a mechanical perspective, all completely identical. They will behave the same in every possible matchup.

Particular congratulations are due here to simon, who managed to figure out this whole mapping within 24 hours (actually posting an essentially-complete solution before the scenario even made it to the frontpage).

Combat

When two characters fight, each one calculates a Combat Score:

In the brutal one-on-ones of Arena combat, which can end within seconds, being just a bit faster than your opponent is vital. Whoever's Speed is higher wins initiative and gets the first strike, which counts for +8 to Combat Score.
Power is also very useful, but it doesn't have a sharp cutoff the way Speed does. Each character adds their Power to their Combat Score.
Each player rolls 1d6 and adds that to their Combat Score.^[1]
Whoever ends up higher wins. In the case of a tie, the winner is decided at random.

So, taking these three characters:

P has Speed 15 and Power 5.
Q has Speed 10 and Power 10.
R has Speed 5 and Power 15.

and considering the matchups between them:

If P fights Q, Q has 5 more Power, but P is faster and gets a +8 bonus. P is therefore 3 points of Combat Score ahead, and will likely win unless Q rolls very well.
Similarly, if Q fights R, Q ends up 3 points of Combat Score ahead.
But if R fights P, R has 10 more power, and P's Speed still only gives a +8 bonus. This time R is 2 points of Combat Score ahead, and will probably win.

This means that there are in general two good strategies:

Being a little bit faster than your opponent (to get the +8 bonus), and then staying close enough in Power that this bonus dominates.
Being much slower than your opponent (allowing them the +8 bonus), but trying to beat them on Power by enough that you're ahead anyway.

STRATEGY

The four House Champions had the following stats (including their items):

House Adelon: 13 Speed/17 Power
House Bauchard: 11 Speed/20 Power
House Cadagal: 24 Speed/9 Power
House Deepwrack: 14 Speed/17 Power

Your six characters had the following stats (before giving them items):

Uzben Grimblade: 14 Speed/10 Power
Varina Dourstone, 6 Speed/18 Power.
Willow Brown: 11 Speed/13 Power.
Xerxes III of Calantha: 13 Speed/11 Power.
Yalathinel Leafstrider: 18 Speed/6 Power.
Zelaya Sunwalker: 11 Speed/15 Power.

First, you want to decide your Speed matchups:

You cannot outspeed House Cadagal's Ninja. All you can do is send your highest-Power character, Varina.
The other three Champions you can outspeed, but you want to do that with the slowest characters you can to save points for Power. To do that, you send:
- Willow, with at least +3 Boots, against House Adelon.
- Zelaya, with at least +1 Boots, against House Bauchard.^[2]
- Xerxes, with at least +2 Boots, against House Deepwrack.
You don't want to use Uzben or Yalathinel (your faster characters) in the optimal solution: since you can outspeed all three of these Champions without them, all swapping them in does is waste Power.
You also don't actually need to use the +4 Boots of Speed: they provide some wiggle room in your solution (with them you could e.g. send Willow against House Deepwrack), but don't actually allow you to send any more powerful characters/outspeed anyone you couldn't already. (This is suspiciously fortunate for you, see the Bonus Objective below).

Then you want to allocate your Power boosts optimally:

Before you distribute any of your Gauntlets:
- Willow against House Adelon has 13 Power vs 17 (4 lower), but is faster (+8), for a net of +4 Combat Score.
- Zelaya against House Bauchard has 15 Power vs 20 (5 lower), but is faster (+8), for a net of +3 Combat Score.
- Varina against House Cadagal has 18 Power vs 9 (9 higher), but is slower (-8), for a net of +1 Combat Score.
- Xerxes against House Deepwrack has 11 Power vs 17 (6 lower), but is faster (+8), for a net of +2 Combat Score.
Since Power boosts have more effect the closer to even the fight is (e.g. going from +1 to +2 is more important than going from +4 to +5), you want to give out your Gauntlets to even this up:
- Willow gets no Gauntlets
- Zelaya gets +1 Gauntlets
- Varina gets +3 Gauntlets
- Xerxes gets +2 Gauntlets
This brings you to +4 Combat Score against every Champion.

BONUS OBJECTIVE

If you noticed the right things in the dataset, you would find that:

Level 1-6 characters are all very common. There are a variety of characters of every race/class/level combination up to Level 6.
Similarly, items up to +3 are very common. You can find +1 through +3 items used everywhere.
Level 7 characters are not common. There are only four different combinations in the dataset: Elf Knight, Human Warrior, Dwarf Monk,^[3] and Elf Ninja.
Similarly, +4 items are not common. The only place +4 Gauntlets show up in the dataset is on the Level 7 Dwarf Monk(s). And the only place +4 Boots show up is on the Level 7 Elf Ninja.
Well, +4 Boots used to show up on the Level 7 Elf Ninja. About 97% of the way through the dataset, the Level 7 Elf Ninja downgraded to +2 Boots for some reason.
Your set of dubiously-acquired magical items includes +4 Boots.^[4]

The Bonus Objective was to realize that your +4 Boots in fact belong to House Cadagal, and that using them is a bad idea:

If you use the +4 Boots of Speed, the Lady Cadagal will believe that you were responsible for their recent theft from her House.
You will earn her lasting enmity, and you'll lose standing with her even if you win. (Counts as -1 win).
By contrast, if you show your honor by returning them to her, you will win her friendship even if you lose in the Arena. (Counts as +1 win).

Congratulations to everyone who figured this out, particularly to abstractapplic, who was the first person to get a reasonably-complete explanation of what had happened.

LEADERBOARD

Player	vs A (Combat Diff)	vs B (Combat Diff)	vs C (Combat Diff)	vs D (Combat Diff)	Expected Wins
simon	Willow w/+3/0 (+4)	Zelaya w/+1/1 (+4)	Varina w/+0/3 (+4)	Xerxes w/+2/2 (+4)	3.778
Optimal Play	Willow w/+3/0 (+4)	Zelaya w/+1/1 (+4)	Varina w/+0/3 (+4)	Xerxes w/+2/2 (+4)	3.778
Yonge	Uzben w/+1/2 (+3)	Zelaya w/+3/1 (+4)	Varina w/+2/3 (+4)	Xerxes w/+4/0 (+2)	3.542
abstractapplic	Uzben w/+3/0 (+1)	Xerxes w/+2/1 (+0)	Varina w/+1/3 (+4)	Yalathinel w/+0/2 (-1)	2.444
SarahSrinivasan	Zelaya w/+3/1 (+7)	Yalathinel w/+1/2 (-4)	Xerxes w/+2/3 (-3)	Willow w/ +4/0 (+4)	2.125
Random Play (with +4 boots)	?	?	?	?	1.710
Random Play (without +4 boots)	?	?	?	?	1.371
Lorxus	Willow w/+2/1 (-3)	Varina w/+4/3 (-7)	Xerxes w/+1/2 (-4)	Yalathinel w/+3/0 (-3)	0.306

Congratulations to all players, particularly to simon, whose perfect figuring-out of mechanics led to a well-deserved perfect score!

DATASET GENERATION

The bulk of fights in the dataset were from the Arena's regular tournaments, where 64 competitors take part in a single-elimination tournament. The initial competitors are biased towards low levels, but by later rounds it's usually higher-level characters who remain.

These tournaments are interspersed with occasional duels (used as part of the charming local system of governance to resolve various issues), which are between relatively high-level characters. Duels occur a bit more than twice as often as tournaments, but since each duel is only a single fight they're a very small minority of fights in the data.

Congratulations to SarahSrinivasan for identifying this!

REFLECTIONS & FEEDBACK REQUEST

The goal I was shooting for with this scenario was to have the simplest underlying system I could manage that still led to interesting emergent behavior.

I felt fairly happy with how this went! There were a lot of sneaky interactions in the data - e.g. as simon's analysis said:

For example, a same-class Elf will tend to beat a same-class Dwarf. And a same-race Fencer will tend to beat a same-race Warrior. But if an Elf Fencer faces a Dwarf Warrior, the Dwarf Warrior will most likely win. Another example with Fencers and Warriors: same-class Elves tend to beat Humans - but not only will a Human Warrior tend to beat an Elf Fencer, but also a Human Fencer will tend to beat an Elf Warrior by a larger ratio than for a same-race Fencer/Warrior matchup???

but once analyzed in detail it was possible to disentangle what led to this.

How did this feel from a player perspective?

I'm also curious how the Bonus Objective felt from a player perspective: did people like it? Did it seem fair? Should I clue players to the existence of such an objective in future?

As usual, I'm also interested to hear any other feedback on what people thought of this scenario. If you played it, what did you like and what did you not like? If you might have played it but decided not to, what drove you away? What would you like to see more of/less of in future? Do you think the scenario was more complicated than you would have liked? Or too simple to have anything interesting/realistic to uncover? Or both at once? Did you like/dislike the story/fluff/theme parts? What complexity/quality scores should I give this scenario in the index?

I'm currently planning to post another scenario in about a month: I would shoot for Nov 29th, except that's Thanksgiving weekend so I might end up going a bit earlier/later. Let me know if some times work well/poorly for you!

^{^}
Entertainingly, these 1d6 rolls were the only part of the ruleset that simon did not deduce.
^{^}
You could swap Willow and Zelaya here, but it will make your Power allocation worse in the next step.
^{^}
There was supposed to be only one of each of these, but due to a bug there were two copies of the Dwarf Monk starting about halfway through the dataset. simon noticed this was a bug, but I think I'm going to go with abstractapplic's interpretation of there being two Dwarf Monks. Perhaps he was duplicated by the Mirror Mage.
^{^}
Did you steal them directly from House Cadagal? Or did you take them from a Thieves' Guild that stole them from House Cadagal? That's really up to you: the Goddess summons Heroes with a wide variety of moralities. But House Cadagal is not likely to be amused either way.

D&D.SciWorld Modeling

Frontpage

47

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:20 AM

[-]abstractapplic6mo*90

Notes on my performance:

Notes on the scenario:

.Making the generation function any more complicated would have made it (marginally but nontrivially) less elegant and harder to explain.

.I might just be being a sore ~~loser~~ only-barely-winner here.

.Puzzles are fun!

[-]aphyer6mo20

ETA: I have finally tracked down the trivial coding error that ended up distorting my model: I accidentally used kRace in a few places where I should have used kClass while calculating simon's values for Speed and Strength.

Thanks for looking into that: I spent most of the week being very confused about what was happening there but not able to say anything.

[-]simon6mo*50

Thanks aphyer, this was an interesting challenge! I think I got lucky with finding the

power/speed mechanic early - the race-class matchups

really didn't, I think, in principle have enough info on their own to make a reliable conclusion from but enabled me to make a genre savvy guess which I could refine based on other info - in terms of scenario difficulty though I think it could have been deducible in a more systematic way by e.g.

looking at item and level effects for mirror matches.

abstractapplic and Lorxus's discovery of

persistent level 7 characters,

and especially SarahSrinivasan's discovery of

the tournament/non tournament structure

meant the players collectively were I think quite a long ways towards fully solving this. The latter in addition to being interesting on its own is very important to finding anything else about the generation due to its biasing effects.

I agree with abstractapplic on the bonus objective.

[-]simon6mo40

One learning experience for me here was trying out LLM-empowered programming after the initial spreadsheet-based solution finding. Claude enables quickly writing (from my perspective as a non-programmer, at least) even a relatively non-trivial program. And you can often ask it to write a program that solves a problem without specifying the algorithm and it will actually give something useful...but if you're not asking for something conventional it might be full of bugs - not just in the writing up but also in the algorithm chosen. I don't object, per se, to doing things that are sketchy mathematically - I do that myself all the time - but when I'm doing it myself I usually have a fairly good sense of how sketchy what I'm doing is*, whereas if you ask Claude to do something it doesn't know how to do in a rigorous way, it seems it will write something sketchy and present it as the solution just the same as if it actually had a rigorous way of doing it. So you have to check. I will probably be doing more of this LLM-based programming in the future, but am thinking of how I can maybe get Claude to check its own work. Some automated way to pipe the output to another (or the same) LLM and ask "how sketchy is this and what are the most likely problems?". Maybe manually looking through to see what it's doing, or at least getting the LLM to explain how the code works, is unavoidable for now.

* when I have a clue what I'm doing which is not the case, e.g. in machine learning.

[-]SarahNibs6mo40

I found myself having done some data exploration but without time to focus and go much deeper. But also with a conviction that bouts were determined in a fairly simple way without persistent hidden variables (see Appendix A). I've done work with genetic programming but it's been many years, so I tried getting ChatGPT-4o w/ canvas to set me up a good structure with crossover and such and fill out the various operation nodes, etc. This was fairly ineffective; perhaps I could have better described the sort of operation trees I wanted, but I've done plenty of LLM generation / tweak / iterate work, and it felt like I would need a good bit of time to get something actually useful.

That said, I believe any halfway decently regularized genetic programming setup would have found either the correct ruleset or close enough that manual inspection would yield the right guess. The setup I had begun contained exactly one source of randomness: an operation "roll a d6". :D

Appendix A: an excerpt from my LLM instructions

I believe the hidden generation is a simple fairly intuitive simulation. For example (this isn't right, just illustrative) maybe first we check for range (affected by class), see if speed (affected by race and boots) changes who goes first, see if strength matters at all for whether you hit (race and gauntlets), determine who gets the first hit (if everything else is tied then 50/50 chance), and first hit wins. Maybe some simple dice rolls are involved.

[-]aphyer6mo40

Yeah, my recent experience with trying out LLMs has not filled me with confidence.

In my case the correct solution to my problem (how to use kerberos credentials to authenticate a database connection using a certain library) was literally 'do nothing, the library will find a correctly-initialized krb file on its own as long as you don't tell it to use a different authentication approach'. Sadly, AI advice kept inventing ways for me to pass in the path of the krb file, none of which worked.

I'm hopeful that they'll get better going forward, but right now they are a substantial drawback rather than a useful tool.

[-]Yonge6mo30

Thank you for posting this. Overall I would rate this as a middle of the road (ie good) scenario. Complexity 3/5, quality 3/5.

I thought the bonus objective was in principal a good addition, though it could have done with an extra couple of known words. As it is unless you spot the anomaly with Cadagals boots it seems next to impossible to figure out what it might mean.

Overall I think a gap of about 2 months is better than short gaps of one month followed by longer gaps of several months. Though possibly not this time as that would put it right in the middle of the Christmas period!

[-]SarahNibs6mo10

I think the bonus objective was a good idea in theory but not well tuned. It suffered from the classic puzzle problem of the extraction being the hard part, rather than the cool puzzle being the hard part.

I think it was perfectly reasonable to expect that at some point a player would group by [level, boots] and count and notice there was something to dig into.

But, having found the elf anomaly, I don't think it was reasonable to expect that a player would be able to distinguish between

do not reveal the +4 boots at all
do not use the +4 boots vs the elf ninja
give the elf ninja the +4 boots to be used in their combat
give the elf ninja the +4 boots afterwards but go ahead and use them first

It's perfectly reasonable to expect that a player could generate a number of hypotheses and guess that the most likely was that they shouldn't reveal the +4 boots at all, but they would have no real way of confirming that guess; the fact that they're rewarded for guessing correctly is probably better than the alternative but is not satisfying IMO.

[-]aphyer6mo30

I think this is just an unavoidable consequence of the bonus objective being outside-the-box in some sense: any remotely-real world is much more complicated than the dataset can ever be.

If you were making this decision at a D&D table, you might want to ask the GM:

How easy is it to identify magic items? Can you tell what items your opponent uses while fighting him? Can you tell what items the contestants use while spectating a fight?
Can we disguise magic items? If we paint the totally powerful Boots of Speed lime green, will they still be recognizable?
How exactly did we get these +4 Boots? Did we (or can we convincingly claim to have) take them from people who stole them, rather than stealing them ourselves?
How honorable is House Cadagal's reputation? If we give the Boots back, will they be grateful enough that it's worth it rather than keeping the Boots?

I can't realistically explain all of these up front in the scenario! And this is just the questions I can think of - in my last scenario (linked comment contains spoilers for that if you haven't played it yet) the players came up with a zany scheme I hadn't considered myself.

Overall, I think if you realized that the +4 Boots in your inventory came from the Elf Ninja you can count yourself as having accomplished the Bonus Objective regardless of what you decided to do with them. (You can imagine that you discussed the matter with the GM and your companions, asked all the questions above, and made a sensible decision based on the answers).

[-]abstractapplic4mo20

Just realized I forgot to mention this: I really like how the interactive handled the Bonus Objective, i.e. if the player is thinking along the right lines their character automatically makes the in-universe sensible/optimal decision for them (which means you can set up a fair Bonus Objective for players who don't live in that universe and so don't have all the context).

[-]SarahNibs6mo20

I liked this one! I was able to have significant amounts of fun with it despite perennial lack-of-time problems.

Pros:

simple enough underlying mechanism to be realistically discoverable
some debias-able selection bias
I could get pretty far by relatively simple data exploration
+4 Boots was fun

Cons:

I really wanted the in-between-tournament matches to mean something, like the winners took the losers equipment or whatnot and you could see that show up later in the dataset, but of course that particular meaning would have added a lot of complexity for no gain.
bonus objective was not confirmable (yep real life is like that but still :D)

It feels like this scenario should be fully knowably solvable, given time, except for the bonus guess at the end, which is very cool.

[-]Lorxus6mo10

I didn't enjoy this one as much, but that's likely down to not having had the time/energy to spend on thinking this through deeply. That said... I did not in fact enjoy it as much and I mostly feel like garbage for having done literally worse than chance, and I feel like it probably would have been better if I hadn't participated at all.

[-]aphyer6mo40

I don't think you should feel bad about that! This scenario was pretty complicated and difficult, and even if you didn't solve it I think "tried to solve it but didn't quite manage it" is more impressive than "didn't try at all"!

Moderation Log