abstractapplic

Sequences

D&D.Sci

Wikitag Contributions

Comments

Sorted by

This seems to imply there are better and worse times of day to try and get immediate feedback. If so, what are they?

Obligatory shill/reminder for any teacher reading this that if they want a never-before-seen educational data science challenge which can't be solo'd by current-gen AI (and is incidentally super easy to mark) they can just DM me; I might want some kind of compensation if it needs to be extensively customized and/or never released to the public, but just sharing scenarios a few months before I post them on LW is something I'd absolutely do for the love of the game. (and if they want any other kind of challenge then uh good luck with that)

I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks?

. . . not quite: I'd forgotten that your threshold was a man-month, instead of a month of clock time. I'll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn't change results much, since 730 hours and 167 hours are right next door on a log(t) scale.

SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance

Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.

I'll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)

Potential biases:

  • The aforementioned fact that privacy levels are upper bounds on how private things might be: afaik, METR has no airtight way to know what fully_private tasks were leaked or plagiarized or had extremely similar doppelganger tasks coincidentally created by unrelated third parties, or which public_problems had solutions posted somewhere they couldn't see but which nevertheless made it into the training data, or whether some public_solutions had really good summaries written down somewhere online.
  • The way later models' training sets can only ever have more benchmark tasks present in them than earlier ones. (I reiterate that at least some of these tasks were first created around the apparent gradient discontinuity starting with GPT-4o.)
  • The fact they're using non-fully_private challenges at all, and therefore testing (in part) the ability of LLMs to solve problems present in (some of their) training datasets. (I get that this isn't necessarily reassuring as some problems we'd find it scary for AI to solve might also have specific solutions (or hints, or analogies) in the training data.)
  • The fact they're using preternaturally clean code-y tasks to measure competence. (I get that this isn't necessarily reassuring as AI development is arguably a preternaturally clean code-y task.)
  • The way Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter. (To my surprise, Estimated tasks didn't make much difference to (my reconstruction of) the analysis due (possibly/plausibly/partially) to current task horizons currently being under an hour; but if someone used HCAST to evaluate more capable future models without doing another round of Baselining . . .)
  • Possibly some other stuff I forgot, idk. (All I can tell you is I don't remember thinking "this seems like it's potentially biased against reaching scary conclusions" at any point when reading the papers.)

Looked into it more and you're right: conventional symbolic regression libraries don't seem to have the "calculate a quantity then use that as a new variable going forward" behavior I'd have needed to get Total Value and then decide&apply a tax rate based on that. I . . . probably should have coded up a proof-of-concept before impugning everyone including myself.

Reflections on my performance:

There's an interesting sense in which we all failed this one. Most other players used AI to help them accomplish tasks they'd personally picked out; I eschewed AI altogether and constructed my model with brute force and elbow grease; after reaching a perfect solution, I finally went back and used AI correctly, by describing the problem on a high level (manually/meatbrainedly distilled from my initial observations) and asking the machine demiurge what approach would make most sense[1]. From this I learned about the fascinating concept of Symbolic Regression and some associated python libraries, which I eagerly anticipate using to (attempt to) steamroll similarly-shaped problems.

(There's a more mundane sense in which I specifically failed this one, since even after building a perfect input-output relation and recognizing the two best archetypes as rebatemaxxing and corpsemaxxing, I still somehow fell at the last hurdle and failed to get a (locally-)optimal corpsemaxxing solution; if the system had followed the original plan, I'd be down a silver coin and up a silver medal. Fortunately for my character's fortunes and fortune, Fortune chose to smile.)

Reflections on the challenge:

A straightforward scenario, but timed and executed flawlessly. In particular, I found the figuring-things-out gradient (admittedly decoupled from the actually-getting-a-good-answer gradient) blessedly smooth, starting with picking up on the zero-randomness premise[2] and ending with the fun twist that the optimal solution doesn't involve anything being taxed at the lowest rate[3].

I personally got a lot out of this one: for an evening's exacting but enjoyable efforts, I learned about an entire new form of model-building, about the utility and limits of modern AI, and about Banker's Rounding. I vote four-out-of-five for both Quality and Complexity . . . though I recognize that such puzzle-y low-variance games are liable to have higher variance in how they're received, and I might be towards the upper end of a bell curve here.

  1. ^

    For a lark, I also tried turning on all ChatGPT's free capabilities and telling it to solve the problem from scratch. It thought for ~30 seconds and then spat out a perfect solution; I spent ~30 further seconds with paperclips dancing before my eyes; I then discovered it hadn't even managed to download the dataset, and was instead applying the not-unreasonable heuristic "if abstractapplic and simon agree on an answer it's probably true".

  2. ^

    There's something fun about how "magic", "games", "bureaucracy", and "magical game bureaucracy" are equally good justifications for a "wait, what paradigm am I even in here?" layer of difficulty.

  3. ^

    I know that part wasn't intentional, but I think rebatemaxxing>corpsemaxxing is nontrivially more compelling than the other way round.

Meta musing:

It looks like the optimal allocation is borderline fraudulent. When I think of in-universe reasons for the TAE to set up Cockatrice Eye rebates the way they did, my best guess is "there's a bounty on these monsters in particular, and the taxmen figure someone showing up with n Cockatrice Eyes will have killed ceil(n/2) of them". This makes splitting our four eyes (presumably collected from two monsters) four ways deceptive; my only consolation is that the apparently-standard divide-the-loot-as-evenly-as-possible thing most other adventuring teams seem to be doing also frequently ends up taking advantage of this incentive structure.

framing contradictory evidence as biased or manipulated

Most contradictory evidence is, to some extent (regardless of what it's contradicting).

dismissing critics as [...] deluded, or self-interested

Most critics are, to some extent (regardless of what they're criticizing).

Assuming I didn't make any mistakes in my deductions or decisions, optimal plan goes like this:

Give everyone a Cockatrice Eye (to get the most out of the associated rebate) and a Dragon Head (to dodge the taxing-you-twice-on-every-Head-after-the-first thing).

Give the mage and the rogue a Unicorn Horn and a Zombie Hand each, and give the cleric four Zombie hands; this should get them all as close to the 30sp threshold as possible without wrecking anything else.

Give literally everything else to the fighter, allowing them to bear the entire 212sp cost; if they get mad about it, analogize it to being a meatshield in the financial world as well as the physical.

Thanks for your reply, and (re-)welcome to LW!

My conclusion is that I'm pretty sure you're wrong in ways that are fun and useful to discuss!

I hope so! Let's discuss.

(Jsyk you can spoiler possible spoilers on Desktop using ">!" at the start of paragraphs, in case you want to make sure no LWers are spoiled on the contents of a most-of-a-century-old play.)

Regarding the witnesses:

I agree - emphatically! - that eyewitness testimony is a lot less reliable than most people believe. I mostly only brought the witnesses up in my discussion because I thought the jury dismissed them for bad reasons, instead of a general policy of "eyewitnesses are unreliable". (In retrospect, I could have been a lot clearer on this point.)

Regarding the knife:

I agree that the knife being unique would have made things a lot more clear-cut, but disagree about the implications.

If no-one is deliberately trying to frame the accused, the odds of the real killer happening to use the same brand of knife as the one he favors are very low. (What fraction of knives* available to potential suspects are of that exact type? One in a hundred, maybe? If we assume no frame-up or suicide and start with your prior probability of 10% then a naive Bayesian update and a factor of 100 moves that to >90% even without other evidence**.)

If he is actively being framed . . . that's not overwhelmingly implausible, since it's not a secret what kind of knife he uses, and the real killer would be highly motivated to shift blame. However, the idea that he'd have lost his knife, by coincidence, at the same time that someone was using an exact duplicate to frame him (and then couldn't find it afterwards, even though it would be decisive for his defense) . . . strains credulity. I'm less sure about how to quantify the possibility a real killer took his knife without him knowing, got into the victim's apartment, and performed the kill all while the accused was out at the movies; but I feel pretty confident the accused's knife was the murder weapon.

*I'm ignoring the effects of the murder weapon being a knife at all because they're surprisingly weak. The accused owns a knife and favors using it, but so would many alternative suspects; and the accused cohabiting with the victim implies he also has easy access to many alternative methods - poison, arranging an accident - that Hypothetical Killer X wouldn't.

**Full disclosure, I didn't actually perform the calculation until I started writing this post; I admit to being surprised by how little a factor of ~100 changes a ~10% prior probability, though I still feel it's a stronger effect than you're accounting for, and for that matter think your base rates are too low to start with (the fight wasn't just a fight, it was the culmination of years of persistent abuse).

Regarding my conspiracy theories:

I agree that the protagonist having ideological or personal reasons to make the case turn out this way is much more likely than him having been successfully bribed or threatened; aside from anything else, the accused doesn't seem terribly wealthy or well-connected.

I also agree with your analysis of the racist juror's emotional state as presented, though I continue to think it's slightly suspicious that things happened to break that conveniently (the Doylist explanation is of course that the director wanted the bigot to come off as weak and/or needed things to wrap up satisfyingly inside a two-hour runtime, but I'm an incorrigible Watsonian.)

Load More