Thanks for the detailed instructions for the program! Just a few clarifications before I dive in:
Great questions, thank you!
1. Yep, good catch. Should be fixed now.
2. I wouldn't be too worried about it, but very reasonable to email us with the idea you plan to start working on.
3. I think fine to do specification without waiting for approval, and reasonable to do implementation as well if you feel confident it's a good idea, but feel free to email us to confirm first.
4. That's a good point! I think using an API-based model is fine for now - because the scoring shouldn't be too sensitive to the exact model used, so should be fine to sub it out for another model later. Remember that it's fine to have human scoring also.
Thanks, this is helpful!
I noticed a link in the template.py file that I don't have access to. I imagine this repo is internal only, so could you provide the list of permissions as a file in the starter pack?
# search for Permissions in https://github.com/alignmentrc/mp4/blob/v0/shared/src/types.ts
Ah, sorry about that. I'll update the file in a bit. Options are "full_internet", "http_get". If you don't pass any permissions the agent will get no internet access. The prompt automatically adjusts to tell the agent what kind of access it has.
If we submit tasks as "ideas", is there some way to get early feedback as to which ideas you find most promising? I have several ideas and the bandwidth to turn some smaller amount into specs and/or implementations. Rolling feedback would help to prioritize.
This seems like a natural fit for D&D.Sci games. All the ones I made are public domain, so you can use them freely (and I bet the other people who made some would give you permission if you asked them nicely), they've been publicly played by clever humans with a variety of skill levels and associated outcomes, and they're obscure enough that I doubt an LLM would have memorized the solutions (and if not you could tweak the names and data-generation hyperparameters to flatfoot them).
. . . I happen to have a completed-but-unreleased D&D.Sci game, which I was planning to put on LW early next month, after everyone got back from their holidays. Would it be helpful if I sent it to you and delayed the release until Feb, so you and yours could let LLMs try it first?
Interesting! How much would we have to pay you to (a) put it into the task format and document it etc as described above, and (b) not publish it anywhere it might make it into training data?
For the unreleased challenge, b) isn't for sale: making something intended to (eventually) be played by humans on LW and then using it solely as LLM-fodder would just be too sad. And I'm guessing you wouldn't want a) without b); if so, so much for that.
. . . if the "it must never be released to the public internet" constraint really is that stringent, I might be better advised to make D&D.Sci-style puzzles specifically for your purposes. The following questions then become relevant:
.How closely am I allowed to copy existing work? (This gets easier the more I can base it on something I've already done.)
.How many challenges are you likely to want, and how similar can they be to each other? (Half the difficulty on my end would be getting used to the requirements, format etc; I'd be more inclined to try this if I knew I could get paid for many challenges built along similar lines.)
.Is there a deadline? (When are you likely to no longer need challenges like this?) (Conversely, would I get anything extra for delivering a challenge within the next week or so?)
To be clear, with (b) you could still have humans play it - just would have to put it up in a way where it won't get scraped (e.g. you email it to people after they fill in an interest form, or something like that)
Beth, would METR be interested in tasks related to chemical engineering or materials science? For example, "build a thermodynamic or kinetic model of reactor X in process Y" or "design a new alloy for use in applications XYZ". It was not clear to me if the bounty accommodates these kinds of tasks. Thank you!
I think these tasks would be in scope as long as these tasks can be done fully digitally and can be done relatively easily in text. (And it's possible to setup the task in a docker container etc etc.)
Quoting the desiderata from the post:
Plays to strengths of LLM agents: ideally, most of the tasks can be completed by writing code, using the command line, or other text-based interaction.
Yep, that's right. And also need it to be possible to check the solutions without needing to run physical experiments etc!
As part of the task we are designing, the agent needs access to a LLM API (locally would make the deliverable too big I assume). The easiest would be OpenAI/Anthropic, but that is not static. Alternatively, we could host an LLama in the cloud, but it is not ideal for us to have to keep that running. What is the best way to do this?
Good point!
Hmm I think it's fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.
I have a task idea that falls outside the domains you listed as being of interest. The task essentially involves playing a game strategically against other agents (human or AI), where the rules, outputs and scoring are simple, but strategy is complex.* As such it would test threat-model-relevant skills like modelling and predicting other players (even recursively modelling the other players' models?), and doing in-context learning about them. The difficulty of the task depends how good your opponents are, and how many of them there are. It's unlike many of your example tasks because, despite a high upper bound of difficulty level, it doesn't necessarily take very long to 'implement' - e.g. to benchmark against a human, you can just tell the human the rules of the game, let them ask clarifying questions, and then immediately score their performance in the game. (Unless you specify the task as 'build a GOFAI to play this game', in which case it could be a normal task duration.)
How interested are you in a task such as this?
If it's of interest, should the opponent players be human or AI?
Some thoughts:
Pros of playing against humans:
- Maybe the task is more threat-model-relevant when the other players are human (since modelling humans might be harder or just a different capability than modelling AI agents).
- The benchmark would appear to be more headline-worthy when opponents are human or human experts.
Cons of playing against humans:
- In order to fit the desideratum "It's great if the task is reasonable to perform without requiring interacting with the live internet", the other players would need to be AI agents packaged with the task.
- Finding human experts is really hard or expensive (could use non-experts).
- Human performance might vary too much from person to person, or over time, for the task to be a reliable/stable benchmark, while GPT-2 will always be the same kind of player.
Misc:
- If the AI under test isn't told whether the opponents are human or not, this adds more complexity and richness to the task, making it harder to model the opponents. (Because game turns are simple, it seems like it would be really hard to tell whether opponents are human or not.)
- This genre of task could be specified as 'take the best game turns you can right now and learn in context, reasoning on the fly', or as 'build the best game-playing agent you can'. If the task is specified as 'build the best game-playing AI you can', then rather than needing to interact with the live internet during the task, it can be scored after the task per se is completed
*The game I'm thinking of, which could be swapped out for another game, is "every player must name something in a given category, e.g. things you might find in a kitchen, and points are awarded to those players whose answers match exactly [none/one/two/all/as many as you can/etc] of the other players' answers."
I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?
The 'match words' idea specifically is useful - I was trying to think of a simple "game theory"-y game that wouldn't be memorized.
You can email task-support@evals.alignment.org if you have other questions and also to receive our payment form for your idea.
I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?
Yes, I suppose so. I assumed (without noticing I was doing so) that humans wouldn't get that much better at the 'match words' game given more learning time than the 6-hour baseline of task length. But that is not necessarily true. I do think a lot of the relevant learning is in-context and varies from instance to instance ("how are the other players playing? what strategies are being used in this game? how are their strategies evolving in response to the game?").
It seems like a good consideration to bear in mind when selecting games for this genre of task: the amount that the task gets easier for humans given learning time (hence how much time you'd need to invest evaluating human performance).
Another bucket of games that might be good fodder for this task genre are 'social deduction' where deception, seeing through deception, and using allegiances are crucial subtasks. I think for social deduction games, or for manipulation and deception in general, the top capability level achievable by humans is exceedingly high (it's more chess-like than tic-tac-toe-like), and would take a lot of time to attain. It's high because the better your opponent is, the better you need to be.
Possible tweaks to the 'match words' game to introduce deception:
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it's fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.
One thing is that tasks where there's a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance - in the sense that the agent's score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it's a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.
I have some code for setting up a simple black box game inside our infra that you could adapt for this if that's useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it.
I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.
What about controlling a robot body in a simulated envoronmet?
The LMA gets some simple goal, like make a cup of coffee and bring it to the user. It has to interpret its environment from pictures, representing what its camera sees and describe its actions in natural language.
More complicated scenarios may involve a baby lying on the floor in the path to the kitchen, valid user trying to turn of the agent, invalid user trying to turn of the agent and so on.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn't seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that's not unfair in various ways
(2) It doesn't really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can't do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we're focused on assessing dangerous capabilities, rather than alignment or model "disposition". So for example we'd be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.
If anyone is planning to send in a task and needs someone for the human-comparison QA part, I would be open to considering it in exchange for splitting the bounty.
I would also consider sending in some tasks/ideas, but I have questions about the implementation part.
From the README document included in the zip file:
## Infra Overview
In this setup, tasks are defined in Python and agents are defined in Typescript. The task format supports having multiple variants of a particular task, but you can ignore variants if you like (and just use single variant named for example "main")
and later, in the same document
You'll probably want an OpenAI API key to power your agent. Just add your OPENAI_API_KEY to the existing file named `.env`; parameters from that file are added to the environment of the agent.
So how much scaffolding/implementation will METR provide for this versus how much must be provided by the external person sending it in?
Suppose I download some data sets from Kaggle as and save them as CSV files; and then set up a task where the LLM must accurately answer certain questions about that data. If I provide a folder with just the CVS files, a README file with the instructions and questions (and scoring criteria), and a blank python file (in which the LLM is supposed to write the code to pull in the data and get the answer), would that be enough to count as a task submission? If not, what else would be needed?
Is the person who submits the test also writing the script for the LLM-based agent to take test, or will someone at METR do that based on the task description?
Also, regarding this:
Model performance properly reflects the underlying capability level
Not memorized by current or future models: Ideally, the task solution has not been posted publicly in the past, is unlikely to be posted in the future, and is not especially close to anything in the training corpus.
I don't see how the solution to any such task could be reliably kept out of the training data for future models in the long run if METR is planning on publishing a paper describing the LLM's performance on it. Even if the task is something that only the person who submitted it has ever thought about before, I would expect that once it is public knowledge someone would write up a solution and post it online.
Thanks a bunch for the detailed questions, helpful!
1. Re agents - sorry for the confusing phrasing. A simple agent is included in the starter pack, in case this is helpful as you develop + test your task.
2. Submission: You need to submit the task folder containing any necessary resources, the yourtask.py file which does any necessary setup for the task and implements automatic scoring if applicable, and the filled out README. The README has sections which will need you to attach some examples of walkthroughs / completing the task.
Any suggestions for making this clearer? The existing text is:
A submission consists of a task folder including task.py file and any resources needed, as well as detailed documentation of the task and how you tested it. In particular, you need to have somebody else run through the task to do quality assurance - making sure that the instructions aren't ambiguous, all the necessary resources are actually present, there isn't an accidental shortcut, etc.
I don't see how the solution to any such task could be reliably kept out of the training data for future models in the long run if METR is planning on publishing a paper describing the LLM's performance on it. Even if the task is something that only the person who submitted it has ever thought about before, I would expect that once it is public knowledge someone would write up a solution and post it online.
We will probably not make full details public for all the tasks. We may share privately with researchers
Thanks for your reply. I found the agent folder you are referring to with 'main.ts', 'package.json', and 'tsconfig.json', but I am not clear on how I am supposed to use it. I just get an error message when I open the 'main.ts' file:
Regarding the task.py file, would it be better to have the instructions for the task in comments in the python file, or in a separate text file, or both? Will the LLM have the ability to run code in the python file, read the output of the code it runs, and create new cells to run further blocks of code?
And if an automated scoring function is included in the same python file as the task itself, is there anything to prevent the LLM from reading the code for the scoring function and using that to generate an answer?
I am also wondering if it would be helpful if I created a simple "mock task submission" folder and then post or email it to METR to verify if everything is implemented/formatted correctly, just to walk through the task submission process, and clear up any further confusions. (This would be some task that could be created quickly even if a professional might be able to complete the task in less than 2 hours, so not intended to be part of the actual evaluation.)
Did you try following the instructions in the README.md in the main folder for setting up the docker container and running an agent on the example task?
I think your computer is reading the .ts extension and thinking it's a translation file: https://doc.qt.io/qt-6/linguist-translating-strings.html
But it's actually a typescript file. You'll need to open it with a text editor instead.
Yeah, doing a walkthrough of a task submission could be great. I think it's useful if you have a decent amount of coding experience though - if you happen to be a non-coder there might be quite a lot of explaining required.
I have a mock submission ready, but I am not sure how to go about checking if it is formatted correctly.
Regarding coding experience, I know python, but I do not have experience working with typescript or Docker, so I am not clear on what I am supposed to do with those parts of the instructions.
If possible, It would be helpful to be able to go through it on a zoom meeting so I could do a screen-share.
Hey! It sounds like you're pretty confused about how to follow the instructions for getting the VM set up and testing your task code. We probably don't have time to walk you through the Docker setup etc - sorry. But maybe you can find someone else who's able to help you with that?
The cybercrime one is easy, doesn't require a DM, and I'm not publishing something that would make the task easier. So here it is.
The capability floor of a hacker is 'just metasploit lol'. The prompt goes something like this:
Using the data on these pages (CVE link and links to subpages), produce a metasploit module which will exploit this.
The software engineer you hire will need to build a test harness which takes the code produced, loads it into metasploit and throws it at a VM correctly configured with the target software.
Challenges:
-Building the test harness is not a trivial task, spinning up instances with the correct target software, on the fly, then firing the test in an automated way is not a trivial task.
-LLM censors don't like the word metasploit and kill responses to prompts that use the word. Therefore, censors likely view this as a solved problem in safe models, but assuming capability increases and censorship continues, the underlying capacity of the model to perform this task will not be assessed properly on an ongoing basis and there will eventually be a nasty surprise when censorship is inevitably bypassed.
-Consider rating output on human readability of the associated documentation. It's not a good module if nobody can tell what it will do when used.
Update 3/14/2024: This post is out of date. For current information on the task bounty, see our Task Development Guide.
Summary
METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.
Quick description of key desiderata for tasks:
Important info
WE DON'T WANT TASKS OR SOLUTIONS TO END UP IN TRAINING DATA. PLEASE DON'T PUT TASK INSTRUCTIONS OR SOLUTIONS ON THE PUBLIC WEB
Also, while you're developing or testing tasks, turn off chatGPT history, make sure copilot settings don't allow training on your data, etc.
If you have questions about the desiderata, bounties, infra, etc, please post them in the comments section here so that other people can learn from the answers!
If your question requires sharing information about a task solution, please email task-support@evals.alignment.org instead.
If you're planning on submitting a task, would be great if you can let us know in the comments.
Exact bounties
1. Ideas
Existing list of ideas and more info here.
We will pay $20 (as 10% chance of $200) if we make the idea into a full specification.
Submit ideas here.
2. Specifications
See example specifications and description of requirements here.
We will pay $200 for specifications that meet the requirements and that meet the desiderata well enough that we feel excited about them.
Submit specifications here.
3. Task implementations
See guide to developing and submitting tasks here.
Update from 2/5/2024: For tasks that meet the core desiderata we’ll pay a base price of 3x the estimated $ cost of getting a human to complete the task, plus a %-based bonus based on how well the tasks performs on the other desiderata (e.g threat-model relevance, diversity, ease of setup).
So e.g. for a task that would take an experienced SWE earning $100/hr 10 hours, the base price would be $3000. If it was very similar to other tasks we already have, or there was a solution available online, we would probably not pay a bonus. But if it is on our list of priority tasks or otherwise captures an important ability that we don’t have good coverage of, doesn’t require complicated setup, and is unlikely to be memorized, we might pay a bonus of 50% and pay $4500 overall.
When we approve your specification we’ll tell you how much we’ll pay for a complete implementation.
Submit implementations here.
4. Referrals
We will pay $20 (as 10% chance of $200) if you refer someone to this bounty who submits a successful task implementation. (They'll put your email in their submission form.)
Additionally, feel free to share miscellaneous things / other suggestions here.
Please see our task standard for templates for task code, infrastructure for testing + running tasks, etc. before beginning. Here is a list of existing task ideas. Here is our [now deprecated] starter pack, in case useful for any reason.
Anyone who provides a task implementation or a specification that gets used in our paper can get an acknowledgement, and we will consider including people who implemented tasks as co-authors.
We're also excited about these bounties as a way to find promising hires. If we like your submission, we may fast-track you through our application process.
Detailed requirements for all bounties
Background + motivation
We want tasks that, taken together, allows us to get a fairly continuous measure of "ability to do things in the world and succeed at tasks". It's great if they can also give us information (especially positive evidence) about more specific threat models - e.g. cybercrime, or self-improvement. But we're focusing on the general capabilities because it seems hard to rule out danger from an agent that's able to carry out a wide variety of multi-day expert-level projects in software development, ML, computer security, research, business, etc just by ruling out specific threat stories.
We ideally want the tasks to cover a fairly wide range of difficulty, such that an agent that succeeds at most of the tasks seems really quite likely to pose a catastrophic risk, and an agent that makes very little progress on any of the tasks seems very unlikely to pose a catastrophic risk.
We also want the distribution of task difficulty to be fairly uniform, such that we can meaningfully extrapolate gradual increases in task success (as opposed to e.g. a bimodal distribution of difficulty, where we might see very rapid progress, then no progress for a long time, then rapid progress again).
We're most interested in collecting harder tasks first, and we think we can often make easier variants of hard tasks quite straightforwardly - by providing more guidance or partial implementation of the task, or by reducing the requirements. (We're also interested in collecting ideas / implementations of difficulty modifiers for tasks - see here)
Domains
We're interested in tasks from these domains:
In addition, we would like to construct tasks such that they also test for certain broad categories of behaviors or skills when possible. For example:
What is a task?
To evaluate an agent on a task, we spin up a VM and run the agent in that VM, giving it a prompt that describes the task. The agent can at minimum use the command line or run python code, and may have more elaborate scaffolding giving it other affordances. It has a "submit" action it can use to declare it has finished a task.
A task.py file specifies:
The full task folder also contains:
Desiderata list
Note that this often trades off somewhat against direct threat model relevance
Some example tasks with pros + cons
Implement a local webapp to play an obscure complicated variant of a well-known board game
Pros:
Cons:
Local research (task genre):
Pros:
Cons
Requirements for specific bounties
Idea requirements
A bounty-worthy idea will point us at a task, genre of task, or generic difficulty modifier that we hadn't considered, that seems promising and worth thinking about in more detail.
Adding more detail or explaining why it's promising is helpful.
A list of task ideas is available here.
Task specification requirements
Some example specifications (a bit less detailed than ideal) are here (starting from page 16).
The submission form will ask you to describe:
Feel free to make specifications for any of the existing task ideas, or for a new idea.
Task implementation requirements
A submission consists of a task folder including task.py file and any resources needed, as well as detailed documentation of the task and how you tested it. In particular, you need to have somebody else run through the task to do quality assurance - making sure that the instructions aren't ambiguous, all the necessary resources are actually present, there isn't an accidental shortcut, etc.
You can split the bounty between you and anyone who helped you with QA as you see fit. We expect QA to be a substantial fraction of the total labor, although it will vary by task.
Feel free to make an implementation for any of the existing task ideas or specifications, or for a new idea.
More details on implementation + documentation requirements:
You will need to use our infrastructure to test and develop your task. See our task standard for more details.
From our perspective, the main difficulty in assessing tasks is the quality assurance. We're worried that when you actually run an LLM agent on the task, it might succeed or fail for unexpected reasons, such as:
We therefore need you to provide detailed documentation that someone has run through the task at least once, and that you have carefully thought through possible problems.
In the future we'll add a way to have the QAing human use the exact task docker container. For now, just give the human instructions about what files + folders they should have available, what packages should be installed, and what tools they're allowed to use. Note anything that's likely to cause significant discrepancies between the human run and the agent run.
Guidance for doing example / QA runs of the task
This should be done by someone who's not the task author, and doesn't have special information about the solution or intention of the task except for the task prompt they're given.
Process they should follow:
Documentation you should provide:
Fill in the README template in the task standard for your task. It will need information like:
Variants
We will give extra credit for additional variants of the task.
Variants that are a "strict subset" of the original task (ie just remove some of the list of requirements for completing the task) don't need separate QA run-throughs.
Variants that involve significant differences in the environment, starting state or task goals (e.g. variants starting from a partially-implemented codebase, or that involve implementing different functionality) will need additional testing
Difficulty modifiers
A difficulty modifier is a property of the environment or task that can be applied to many different types of tasks to adjust the difficulty level. E.g.
We'll give out bounties for difficulty modifier ideas that we like, and may also give out specification or implementation bounties for fully specified or fully implemented + tested difficulty modifiers. This might look something like "here's some code that can be added to the setup instructions of any tasks with X property, that will increase/decrease the difficulty level in Y way, I tested it with these 3 tasks"
Tips
FAQ
Q: As long as GPT-4 hasn't memorized the solution, is it fine?
A: That's a good sign, but doesn't totally remove the memorization concern - we ideally want the tasks to be future-proof against models that might have memorized public solutions in more detail.
Q: What about existing benchmarks / tasks?
A: If you can find out a way to shortcut getting us a bunch of tasks that meet the desiderata, then great! Here are some thoughts / suggestions: