This is METR’s collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models. The resources include a task suite, some software tooling, and guidelines on how to ensure an accurate measurement of model capability. Building on those, we’ve written an example evaluation protocol. While intended as a “beta”...
Update 3/14/2024: This post is out of date. For current information on the task bounty, see our Task Development Guide. Summary METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents. Quick description of key...
Update: We are no longer accepting gnarly bug submissions. However, we are still accepting submissions for our Task Bounty! Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example. METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt...
Authors' Contributions: Both authors contributed equally to this project as a whole. Evan did the majority of implementation work, as well as the work for writing this post. Megan was more involved at the beginning of the project, and did the majority of experiment design. While Megan did give some...
The first half of this post uses causal tracing to explore differences in how GPT2-XL handles completing cached phrases vs completing factual statements. The second half details my attempt to build intuitions about the high-level structure of GPT2-XL and is speculation heavy. Some familiarity with transformer architecture is assumed but...
I try out "let's gather the relevant facts" as a zero-shot question answering aid on TruthfulQA. It doesn't help more than other helpful prompts. Possibly it might work better on more typical factual questions. This post could potentially be useful to people interested in playing with OpenAI's API, or who...