naterush - LessWrong

For anyone too lazy to look through SWELancer themselves:

There's some sub benchmarks, but the main coding benchmark is pretty much like "here's a real issue description, fix it." There are also real dollar values attached to the issues (based on bounties the open source maintainers put on the issue).
The paper claims: these issues are filtered less than SWE-Bench Verified (which e.g. filters for PRs that have automatic tests). But it's not clear this is true. For example, if you look through the issues that e.g. Expensify has in the benchmark, they say things like "I'm not sure this is a great issue for external contractors. @expensify does this feel well-contained enough for you to bounty it?" Aka, do not think of this as a sampling of all PRs from expensify -- it's actually ones designed for low-context.
Notably, the current benchmark is quite unfair to the agent:
1. The benchmark does not include comments on the issue in the context given to the agent. Notably, of the 3 issues sampled, the comments included design work that described how the solution should be implemented!
2. Agents did not have multi-modal inputs. 75% of the issues had either videos or screenshots of the solutions (either in the main body of the issue, or in the comments). Agents couldn't see these!
3. Agents were given a user tool that ran their proposed changes -- but notably, they can't look at the screenshots of the resulting user trace (they only get to see text trace logs).
In the minor looking I've done at the test cases, they don't look bad but but they also don't look super robust. I expect models to cheese them sometimes!

Overall, my expectation is that:

The issues are in fact filtered quite aggressively for issues that are well-contained and have little context requirements (so they can be bountied).
A better-elicited agent would in fact do much better: give it multimodal, give it the extra comments on the issue.
This benchmark exists in part to connect agents to economic value. The 1M dollar number sounds really nice! I think there's a lot left to be said about economic value with a benchmark - and I'm personally excited to push on this!

Shh, don't tell the AI it's likely to be evil

How do you define transformative AI? If ChatGPT gets 10x better (e.g. it can write most code, answer most questions as good as the experts in a subject, etc) -- would this qualify?

How would you even force an AI to use the weights (simplification) that correspond to the fact vs. fiction anyways?

Also, what really is the difference between our history textbooks and our fiction to an AI that's just reading a bunch of text? I'm not being flippant. I'm genuinely wondering here! If you don't imbue these models with an explicit world-model, why would one be always privileged over the other?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments