Typo in title: "SOLELY" has more Ls in it.
ETA: fixed now.
This seems to imply there are better and worse times of day to try and get immediate feedback. If so, what are they?
Obligatory shill/reminder for any teacher reading this that if they want a never-before-seen educational data science challenge which can't be solo'd by current-gen AI (and is incidentally super easy to mark) they can just DM me; I might want some kind of compensation if it needs to be extensively customized and/or never released to the public, but just sharing scenarios a few months before I post them on LW is something I'd absolutely do for the love of the game. (and if they want any other kind of challenge then uh good luck with that)
I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks?
. . . not quite: I'd forgotten that your threshold was a man-month, instead of a month of clock time. I'll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn't change results much, since 730 hours and 167 hours are right next door on a log(t) scale.
SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance
Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.
I'll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)
Potential biases:
Looked into it more and you're right: conventional symbolic regression libraries don't seem to have the "calculate a quantity then use that as a new variable going forward" behavior I'd have needed to get Total Value and then decide&apply a tax rate based on that. I . . . probably should have coded up a proof-of-concept before impugning everyone including myself.
Reflections on my performance:
There's an interesting sense in which we all failed this one. Most other players used AI to help them accomplish tasks they'd personally picked out; I eschewed AI altogether and constructed my model with brute force and elbow grease; after reaching a perfect solution, I finally went back and used AI correctly, by describing the problem on a high level (manually/meatbrainedly distilled from my initial observations) and asking the machine demiurge what approach would make most sense[1]. From this I learned about the fascinating concept of Symbolic Regression and some associated python libraries, which I eagerly anticipate using to (attempt to) steamroll similarly-shaped problems.
(There's a more mundane sense in which I specifically failed this one, since even after building a perfect input-output relation and recognizing the two best archetypes as rebatemaxxing and corpsemaxxing, I still somehow fell at the last hurdle and failed to get a (locally-)optimal corpsemaxxing solution; if the system had followed the original plan, I'd be down a silver coin and up a silver medal. Fortunately for my character's fortunes and fortune, Fortune chose to smile.)
Reflections on the challenge:
A straightforward scenario, but timed and executed flawlessly. In particular, I found the figuring-things-out gradient (admittedly decoupled from the actually-getting-a-good-answer gradient) blessedly smooth, starting with picking up on the zero-randomness premise[2] and ending with the fun twist that the optimal solution doesn't involve anything being taxed at the lowest rate[3].
I personally got a lot out of this one: for an evening's exacting but enjoyable efforts, I learned about an entire new form of model-building, about the utility and limits of modern AI, and about Banker's Rounding. I vote four-out-of-five for both Quality and Complexity . . . though I recognize that such puzzle-y low-variance games are liable to have higher variance in how they're received, and I might be towards the upper end of a bell curve here.
For a lark, I also tried turning on all ChatGPT's free capabilities and telling it to solve the problem from scratch. It thought for ~30 seconds and then spat out a perfect solution; I spent ~30 further seconds with paperclips dancing before my eyes; I then discovered it hadn't even managed to download the dataset, and was instead applying the not-unreasonable heuristic "if abstractapplic and simon agree on an answer it's probably true".
There's something fun about how "magic", "games", "bureaucracy", and "magical game bureaucracy" are equally good justifications for a "wait, what paradigm am I even in here?" layer of difficulty.
I know that part wasn't intentional, but I think rebatemaxxing>corpsemaxxing is nontrivially more compelling than the other way round.
Meta musing:
It looks like the optimal allocation is borderline fraudulent. When I think of in-universe reasons for the TAE to set up Cockatrice Eye rebates the way they did, my best guess is "there's a bounty on these monsters in particular, and the taxmen figure someone showing up with n Cockatrice Eyes will have killed ceil(n/2) of them". This makes splitting our four eyes (presumably collected from two monsters) four ways deceptive; my only consolation is that the apparently-standard divide-the-loot-as-evenly-as-possible thing most other adventuring teams seem to be doing also frequently ends up taking advantage of this incentive structure.
framing contradictory evidence as biased or manipulated
Most contradictory evidence is, to some extent (regardless of what it's contradicting).
dismissing critics as [...] deluded, or self-interested
Most critics are, to some extent (regardless of what they're criticizing).
Your question cuts to the heart of our movement: it's been over two decades and (afaik, afaict) we still don't have a robust & general way to test whether people are getting more rational.