abstractapplic - LessWrong

Curated posts get their dates reset to the time they were curated. I don't know why.

I made a card game to reduce cognitive biases and logical fallacies but I'm not sure what DV to test in a study on its effectiveness.

abstractapplic1mo72

Your question cuts to the heart of our movement: it's been over two decades and (afaik, afaict) we still don't have a robust & general way to test whether people are getting more rational.

School & Jobs are good SOLELY because people are lazy

abstractapplic2mo*31

Typo in title: "SOLELY" has more Ls in it.

ETA: fixed now.

PSA: The LessWrong Feedback Service

abstractapplic2mo70

This seems to imply there are better and worse times of day to try and get immediate feedback. If so, what are they?

Cheaters Gonna Cheat Cheat Cheat Cheat Cheat

abstractapplic3mo30

Obligatory shill/reminder for any teacher reading this that if they want a never-before-seen educational data science challenge which can't be solo'd by current-gen AI (and is incidentally super easy to mark) they can just DM me; I might want some kind of compensation if it needs to be extensively customized and/or never released to the public, but just sharing scenarios a few months before I post them on LW is something I'd absolutely do for the love of the game. (and if they want any other kind of challenge then uh good luck with that)

Notes on the Long Tasks METR paper, from a HCAST task contributor

abstractapplic3mo92

I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks?

. . . not quite: I'd forgotten that your threshold was a man-month, instead of a month of clock time. I'll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn't change results much, since 730 hours and 167 hours are right next door on a log(t) scale.

SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance

Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.

I'll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)

Notes on the Long Tasks METR paper, from a HCAST task contributor

abstractapplic3mo*122

Potential biases:

The aforementioned fact that privacy levels are upper bounds on how private things might be: afaik, METR has no airtight way to know what fully_private tasks were leaked or plagiarized or had extremely similar doppelganger tasks coincidentally created by unrelated third parties, or which public_problems had solutions posted somewhere they couldn't see but which nevertheless made it into the training data, or whether some public_solutions had really good summaries written down somewhere online.
The way later models' training sets can only ever have more benchmark tasks present in them than earlier ones. (I reiterate that at least some of these tasks were first created around the apparent gradient discontinuity starting with GPT-4o.)
The fact they're using non-fully_private challenges at all, and therefore testing (in part) the ability of LLMs to solve problems present in (some of their) training datasets. (I get that this isn't necessarily reassuring as some problems we'd find it scary for AI to solve might also have specific solutions (or hints, or analogies) in the training data.)
The fact they're using preternaturally clean code-y tasks to measure competence. (I get that this isn't necessarily reassuring as AI development is arguably a preternaturally clean code-y task.)
The way Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter. (To my surprise, Estimated tasks didn't make much difference to (my reconstruction of) the analysis due (possibly/plausibly/partially) to current task horizons currently being under an hour; but if someone used HCAST to evaluate more capable future models without doing another round of Baselining . . .)
Possibly some other stuff I forgot, idk. (All I can tell you is I don't remember thinking "this seems like it's potentially biased against reaching scary conclusions" at any point when reading the papers.)

D&D.Sci Tax Day: Adventurers and Assessments Evaluation & Ruleset

abstractapplic3mo60

Looked into it more and you're right: conventional symbolic regression libraries don't seem to have the "calculate a quantity then use that as a new variable going forward" behavior I'd have needed to get Total Value and then decide&apply a tax rate based on that. I . . . probably should have coded up a proof-of-concept before impugning everyone including myself.

D&D.Sci Tax Day: Adventurers and Assessments Evaluation & Ruleset

abstractapplic3mo*40

Reflections on my performance:

There's an interesting sense in which we all failed this one. Most other players used AI to help them accomplish tasks they'd personally picked out; I eschewed AI altogether and constructed my model with brute force and elbow grease; after reaching a perfect solution, I finally went back and used AI ~~correctly~~~~, by describing the problem on a high level (manually/meatbrainedly distilled from my initial observations) and asking the machine demiurge what approach would make most sense~~^[1]~~. From this I learned about the fascinating concept of~~ ~~Symbolic Regression~~ ~~and some associated python libraries, which I eagerly anticipate using to (attempt to) steamroll similarly-shaped problems.~~

(There's a more mundane sense in which I specifically failed this one, since even after building a perfect input-output relation and recognizing the two best archetypes as rebatemaxxing and corpsemaxxing, I still somehow fell at the last hurdle and failed to get a (locally-)optimal corpsemaxxing solution; if the system had followed the original plan, I'd be down a silver coin and up a silver medal. Fortunately for my character's fortunes and fortune, Fortune chose to smile.)

Reflections on the challenge:

A straightforward scenario, but timed and executed flawlessly. In particular, I found the figuring-things-out gradient (admittedly decoupled from the actually-getting-a-good-answer gradient) blessedly smooth, starting with picking up on the zero-randomness premise^[2] and ending with the fun twist that the optimal solution doesn't involve anything being taxed at the lowest rate^[3].

I personally got a lot out of this one: for an evening's exacting but enjoyable efforts, I learned about an entire new form of model-building, about the utility and limits of modern AI, and about Banker's Rounding. I vote four-out-of-five for both Quality and Complexity . . . though I recognize that such puzzle-y low-variance games are liable to have higher variance in how they're received, and I might be towards the upper end of a bell curve here.

^{^}
For a lark, I also tried turning on all ChatGPT's free capabilities and telling it to solve the problem from scratch. It thought for ~30 seconds and then spat out a perfect solution; I spent ~30 further seconds with paperclips dancing before my eyes; I then discovered it hadn't even managed to download the dataset, and was instead applying the not-unreasonable heuristic "if abstractapplic and simon agree on an answer it's probably true".
^{^}
There's something fun about how "magic", "games", "bureaucracy", and "magical game bureaucracy" are equally good justifications for a "wait, what paradigm am I even in here?" layer of difficulty.
^{^}
I know that part wasn't intentional, but I think rebatemaxxing>corpsemaxxing is nontrivially more compelling than the other way round.

D&D.Sci Tax Day: Adventurers and Assessments

abstractapplic3mo*40

Meta musing:

It looks like the optimal allocation is borderline fraudulent. When I think of in-universe reasons for the TAE to set up Cockatrice Eye rebates the way they did, my best guess is "there's a bounty on these monsters in particular, and the taxmen figure someone showing up with n Cockatrice Eyes will have killed ceil(n/2) of them". This makes splitting our four eyes (presumably collected from two monsters) four ways deceptive; my only consolation is that the apparently-standard divide-the-loot-as-evenly-as-possible thing most other adventuring teams seem to be doing also frequently ends up taking advantage of this incentive structure.

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments