Tao Lin

Send us example gnarly bugs

Update: We are no longer accepting gnarly bug submissions. However, we are still accepting submissions for our Task Bounty! Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example. METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we’re seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/hr for time spent putting it into our required format, or $200, whichever is greater. (We won’t pay for submissions that don’t meet these requirements.) If we’re particularly excited about your submission, we may also be interested in purchasing IP rights to it. We expect to want about 10-30 examples overall depending on the diversity. We're likely to be putting bounties on additional types of tasks over the next few weeks. Criteria for submission: * Contains a bug that would take at least 6 hours for an experienced programmer to solve, and ideally >20hrs * More specifically, ">6 hours for a decent engineer who doesn't have context on this particular codebase". E.g. a randomly selected engineer who's paid $100-$200 per hour who's familiar with the language and overall stack that's being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug. * Ideally, has not been posted publicly in the past * (Though note that we may still accept submissions from public repositories given that they are not already in a SWE-bench dataset and meet the rest of our requirements. Check with us first.) * You have the legal right to share it with us (e.g. please don’t send us other people’s proprietary code or anything you signed an NDA about) * Ideally, the task should work well with static resources - e.g. you can have a local copy of the documentation for all the relevant libra

77Dec 10, 2023

Tao Lin

Message

1001

181

Send us example gnarly bugs

Dec 10, 202377

Causal scrubbing: results on induction heads

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...

Dec 3, 202234

Causal scrubbing: results on a paren balance checker

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to an algorithmic model. The results are also summarized here. Introduction In earlier work (unpublished), we dissected a tiny transformer that classifies whether a string of parentheses is balanced or unbalanced.[1] We hypothesized the...

Dec 3, 202239

Tao Lin's Shortform

Jul 30, 20212

LESSWRONG
LW

LESSWRONG
LW

Tao Lin

Tao Lin

Tao Lin

Send us example gnarly bugs

Causal scrubbing: results on a paren balance checker

Causal scrubbing: results on induction heads

Tao Lin's Shortform

Tao Lin

Send us example gnarly bugs

Causal scrubbing: results on induction heads

Causal scrubbing: results on a paren balance checker

Tao Lin's Shortform

Send us example gnarly bugs

Causal scrubbing: results on induction heads

Causal scrubbing: results on a paren balance checker

Tao Lin's Shortform

Send us example gnarly bugs

Causal scrubbing: results on a paren balance checker

Causal scrubbing: results on induction heads

Tao Lin's Shortform