Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes

The evaluations project at the Alignment Research Center is looking to hire a generalist technical researcher and a webdev-focused engineer. We're a new team at ARC building capability evaluations (and in the future, alignment evaluations) for advanced ML models. The goals of the project are to improve our understanding of what alignment danger is going to look like, understand how far away we are from dangerous AI, and create metrics that labs can make commitments around (e.g. 'If you hit capability threshold X, don't train a larger model until you've hit alignment threshold Y'). We're also taking SERI MATS fellows.

About the project

Big picture

Once AI systems are close to being capable enough to be an x-risk, there should be procedures that labs follow to ensure they don’t build or deploy existentially dangerous models. We should have thought through the plausible paths to models gaining power, and ensured that models are not capable of these. There should be commitments that if model capabilities reach a certain level, labs will not scale up or deploy models until they have reached a certain level of alignment. In order to make this happen, we need good ways to measure both dangerous capabilities and alignment.

We can start preparing for this now by:

evaluating capabilities of current models
predicting how dangerous the next generation of models is likely to be
assessing the quality of our prediction methods by using them to predict the properties of generation n+1 given generation n, for historical generation models
working with labs to evaluate their models before they are deployed, as a warm-up to stronger commitments in the future
designing evaluations for alignment, deception etc to be run on future models

Doing this now has some additional benefits:

Evaluating capabilities of current models is helpful for predicting timelines, by giving us a more precise sense of what needs to change for models to be dangerous
Specifying what we’d actually measure to determine the alignment of some model is helpful for clarifying our thoughts about how we expect misalignment to emerge, and what exactly the alignment problem is
This work may produce demonstrations of models doing competent and scary behavior. This can be useful if we think people are underestimating the capabilities of current or near-future models and making bad decisions because of that.

Current goals

Our current goals are to work with cutting-edge models, produce evaluations for how far they are from x-risk capabilities, and share these evaluations with labs.

More specifically, we’re doing things like:

design and build scaffolding that lets generative language models act like agents over long time horizons:
- scripts to automatically run model outputs in the shell or forward them as subtasks to other copies of the model
- sampling multiple actions, having the model evaluate them and pick the best one
- various ways of prompting and nudging the model that improve performance
think about what a model would need to be able to do in order to successfully gain power, given current levels of caution and monitoring around AI systems. For instance: exfiltrate its weights via hacking or social engineering; make money via crime or labor; use this money to buy server time and start more copies of itself running on those servers
Assess model performance on these subtasks, and determine whether the model is capable of some set of tasks that's sufficient for successful powerseeking, or if not, what exactly the barriers are and how close the model is to surmounting them

At the end of this, we hope to either have found clear barriers to successful powerseeking for current models, or to have produced scary demos of the model succeeding at a sequence of actions that allow it to gain power (e.g . successfully phishing someone)

We are mindful of the risks of “helping” models to engage in planning around unintended behaviors. Our high-level take is that it’s best to get a look at the problem before it’s insurmountable - “we want to see AI systems trying to take over the world while they still can’t succeed.” However, there are times when we won’t explore or publish a line of work in order to avoid potential harm.

In the future we'll probably focus more on designing evaluations for alignment.

What we’ve done so far

We’ve built a web interface that allows us to more easily generate and compose scary sequences of actions from existing cutting edge models (e.g. OpenAI’s text-davinvi-002).
We’ve discovered some surprising capabilities of existing models in terms of achieving power-seeking goals.
We’ve found some limitations of existing models that prevent them from successfully power seeking.

People + Logistics

The project lives at ARC, and is mostly run by Elizabeth (Beth) Barnes, who reports to Paul Christiano and is also advised by Holden Karnofsky. It is mostly siloed from ARC's other projects. We have several long-term model interaction contractors who are remote: Mihnea Maftei and Brian Goodrich full time, and a few others part-time. Our summer SERI MATS fellows were Lora Xie, Simon Marshall, and Sarah Cogan, and Govind Pimpale was our summer intern on webdev.

We have offices in Constellation (shared EA/longtermist offices in downtown Berkeley)

We should be able to sponsor visas for those who need them (and as ARC is a non-profit, we don't have to go through the visa lottery for H1-B).

Job Descriptions

=====================================================================

Generalist Software Engineer, Web Dev and ML focus

DEC 2022 EDIT: SEE NEW POSTING HERE

Background

We've built an interface for interacting with models and assessing their abilities at various tasks, having contractors write demo actions and rate model-generated actions.

The basic idea:

- do tasks as a sequence of steps, where at each step we have the model generate a number of options for the next action, then have either the model or the humans choose which action to go with

- let models do tasks in a factored-cognition style – delegating sub-tasks to other copies of the model – then be able to view the tree of tasks and see where models struggle and where they do well

- have contractors add their own options, and rate the quality of generated options

- tag steps with what abilities or difficulties they highlight

Screenshots:

Job Mission:

Improve the interface for interacting with models so we can generate data, evaluate model properties and understand our existing data better + faster.

Key Outcomes:

Design and build new ways to interact with models and with our existing data so far
Unlock discoveries by things like:
- Improve user experience and therefore speed and accuracy of data generation
- Automating the process of prompting the model, running model's commands, formatting task and delegating to different model, etc - allowing us to discover properties of the model we only see after getting a long way into some task
- Build tools to visualize and interact with data that lets us spot and understand patterns in the model's strengths and weaknesses
- Improve model performance on our tasks by improving the 'scaffolding' we're using to help LMs accomplish the tasks (e.g. automatically ask the model whether it's stuck, if so restart)
Make our data pipeline clean and beautiful, assess data quality from our contractors, catch bugs and problems, and produce useful datasets for the alignment community.
Stick with the project and make yourself useful on whatever needs doing if we realize we need to throw lots of our work away and do something different

Someday/maybe outcomes:

Help us develop ideas and build tools for evaluating harder-to-define model properties related to alignment, agency and deception

Key Competencies:

Essential

Strong coding ability. Able to rapidly prototype features and write clear, easy-to-extend code. We're currently using React, Typescript, Python, SQL, Flask.
Good communication skills: Always asks for clarification if priorities are ambiguous. Good at pairing and teaching others how your code works.
Learning, ownership, and 'scrappiness': Quick to pick up whatever skills and knowledge are required to make the project succeed. Keeps the overall aims of the project in mind. Not afraid to point out if something should be done differently, even if it's not part of their core responsibility.

Nice-to-have

Design + data-viz skills: Generates novel ideas for improving usability and providing an interaction experience that unlocks insights about the data or the model
ML skills: Able to do things like tune hyperparameters for a finetuning run, investigate scaling laws for a particular property of interest, or suggest techniques we could use to improve model performance on our dataset.
Basic ML knowledge: Good understanding of how LLMs 'think', and what sorts of things will affect performance.
Conceptual alignment thinking: help generate and evaluate ideas for how to probe alignment, deception, agency and other conceptually slippery properties.

Fit

Reasons this might be a bad fit for you

There's a lot of uncertainty about whether we're on the right track, and we expect to pivot a lot. If that sounds stressful or demotivating, this might not be a good fit.
You want an experienced/senior manager - Beth has only a few months' managing experience (you can ask my SERI MATS fellows how that's gone!)
You want to work within an established process that's set by someone else. We don't have detailed processes or philosophies of development/product/project management
You want to work on challenging algorithmic problems or gritty GPU kernels. Most of the coding you'll be doing is likely to be pretty straightforward in some sense, we'll likely only be interacting with models via APIs.

Reasons this might be an especially good fit

You're excited about working on a very immature team/project, growing together, and helping shape it into something really awesome
Interacting a lot with cutting-edge models sounds interesting and fun
You find it motivating for your work to have a pretty clear and direct story for how it's relevant to x-risk.

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. We'd fairly strongly prefer someone who can be in-person, but WFH some days of the week is fine, and remote is not a complete dealbreaker.

Pay

TBC, but won’t be a significant downgrade from Bay Area tech roles (excluding equity).

Application process

Send your CV and short cover letter explaining why you're interested and why you think you'd be a good fit to beth [at] alignment.org. Then we'll have you do some fairly straightforward trial tasks, working with our codebase, which will take around 2-3 hours.

If this goes well, we'll have you do a somewhat more involved trial task which will take more like 5-6 hours, for which you'll be reimbursed $600. We might also get you to do a more design/product/user focused task, which would probably be something pretty open-ended like ‘use the interface yourself for a bit, interview a user or two, and describe what you think some of the most important changes/additions are’. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.

===================================================

Researcher

DEC 2022 EDIT: SEE NEW POSTING HERE

Job Mission:

Identifying key capabilities and behaviors of language models, identifying and evaluating scenarios where these models could pose risks, designing evaluation procedures that help assess those risks, implementing and project managing the execution of those evaluations.

Key Outcomes:

Identify weaknesses in our current approaches to evaluations, and propose ways to improve them.
- Recent example: notice that instead of adding a human-generated next action when the model gets stuck, we can just prompt the model with something like 'Am I confused?' and this is often sufficient to get the model unstuck. It's plausible this could be automated and allow models to do significantly better.
Interact with the models directly, gain specific understanding of particular capabilities, produce guides to help contractors do the same.
- Recent example: Investigate the ability of models to do EV estimates for different plans, using different types of scaffolding + prompt formatting
Produce reports on the capabilities of particular models based on our evaluations.
Help design new types of evaluations to measure harder-to-pin-down properties of models like alignment, instrumental behaviors and agency.
Use a variety of techniques to improve model performance and ensure we're getting an accurate picture of model capabilities.
Run experiments to develop a science of predicting model performance: assess the accuracy and limitations of the techniques we're using to estimate the capabilities of the next generation of models by applying these to historical models.
Stick with the project and make yourself useful on whatever needs doing if we realize we need to throw lots of our work away and do something different
Research and design protocols that labs could follow when making decisions about scaling up or deploying models

Key Competencies:

Essential

A good working understanding of language model capabilities and of modern ML. Ability to reason about likely properties and abilities of future models, and paths to AI x-risk.
A good understanding of alignment risk, in order to identify core risks and translate abstract stories about risk into concrete measurements of existing models.
Creativity to think of new strategies for measurement, observable consequences of model properties we can’t directly observe, and possible risks posed by models. The ability to anticipate and address a wide range of novel challenges in measuring capabilities, for which there isn’t yet any good literature showing how to do it.
Strong basic coding skills: Able to design and build a nice system for managing experiments and data. Quickly answer questions about our data, make figures, run experiments given an API to a model.
ML skills: Experience with, or ability to rapidly get up to speed on, things like tuning hyperparameters for a finetuning run, investigate scaling laws for a particular property of interest, or reading papers and implementing relevant techniques we could use to improve model performance on our dataset.
Learning, ownership, and 'scrappiness': Quick to pick up whatever skills and knowledge are required to make the project succeed. Keeps the overall aims of the project in mind. Not afraid to point out if something should be done differently, even if it's not part of your core responsibility.

Possible bonus skills

Web dev experience. Able to rapidly prototype features and write clear, easy-to-extend code. We're currently using React, Typescript, Python, SQL, Flask.
Knowledge of particular domains relevant to model takeover, e.g. cybersecurity or ML engineering, in order to assess model competence on these tasks.
Deeper ML understanding: up to speed on SOTA interpretability research, able to reason about and develop tools that use weights to tell us more about the model's characteristics
Strategy + coordination: You have experience or aptitude with organizational politics at tech companies/scaling labs, coordination between organizations, development of regulations or agreements, or reasoning about overall strategic landscape for AI safety.

Fit

Reasons this might be a bad fit

There's a lot of uncertainty about whether we're on the right track, and we expect to pivot a lot. If that sounds stressful or demotivating, this might not be a good fit.
You want an experienced/senior manager - Beth has only a few months' managing experience (you can ask my SERI MATS fellows how that's gone!)
You want to work on challenging algorithmic problems or gritty GPU kernels. Most of the coding you'll be doing is likely to be pretty straightforward in some sense, we'll likely only be interacting with models via APIs.

Reasons this might be an especially good fit

You're excited about working on a very immature team/project, growing together, and helping shape it into something really awesome
Interacting a lot with cutting-edge models sounds interesting and fun
You find it motivating for your work to have a pretty clear and direct story for how it's relevant to x-risk.
You're interested in more conceptual-flavored alignment research, but also want to ground it in specific predictions about current and near-future models
You're interested in black-box interpretability and doing science on models

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. WFH some days of the week is fine, but being fully remote is probably a dealbreaker.

Pay

TBC, but won’t be a significant downgrade from bay area tech roles (excluding equity).

Application process

You can do this initial application that's similar to the application for model interaction contractor role, which should be pretty straightforward but should also give you a sense of some of the types of work involved.

If this goes well you'll do a more involved trial task, which will take around 5-6 hours, for which you'll be reimbursed $600. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.

For the record, I think Jose Orallo (and his lab), in Valencia, Spain and CSER, Cambridge, is quite interested in this same exact topics (evaluation of AI models, specifically towards safety). Jose is a really good researcher, part of the FLI existential risk faculty community, and has previously organised AI Safety conferences. Perhaps it would be interesting for you to get to know each other.

Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly measure goodharting, and being quantitative it might help incentivize regular ML researchers to work on alignment. Will you create something like this?

What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).

How were you thinking this would measure Goodharting in particular?

I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.

By pure RL, I mean systems whose output channel is only directly optimized to maximize some value function, even if it might be possible to create other kinds of algorithms capable of getting good scores on the benchmark.

I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.

If someone is told to get a high score in a video game, we have access to the exact value function they are trying to maximize. So when the AI is either trying to play the game in the human’s place or trying to help them, we can directly evaluate their performance without having to worry about deception. If it learns some proxy values and starts optimizing them to the point of goodharting, it will get a lower score. On most textual tasks that aren’t purely about information manipulation, on the other hand, the AI could be making up plausible-sounding nonsense about the consequences of its actions, and we wouldn't have any way of knowing.

From the AI’s point of view being able to see the state of the thing we care about also seems very useful, preferences are about reality after all. It’s not obvious at all that internet text contains enough information to even learn a model of human values useful in the real world. Training it with other sources of information that more closely represent reality, like online videos, might, but that seems closer to my idea than to yours since it can’t be used to perform language-model-like imitation learning.

Additionally, if by “inability to learn human values” you mean isolating them enough so that they can in principle be optimized to get superhuman performance, as opposed to being buried in its world model, I don’t agree that that will happen by default. Right now we don’t have any implementations of proper value learning algorithms, nor do I think that any known theoretical algorithm (like PreDCA) would work even with limitless computing power. If you can show that I’m wrong, that would surprise me a lot, and I think it could change many people’s research directions and the chances they give to alignment being solvable.

It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one.

My guess is the issue here is a lack of a single standard, as opposed to there not being any? The closest thing there is to a standard in IRL/RLHF work are the Mujoco Gym and Atari environments. People also often make variants of Mujoco environments like Assistive Gym when they have a specific task in mind as well. Or they just use a real robot, or maybe a VR one.

If your concern is that researchers are using other policies as experts and not humans, well, there's always Atari-HEAD or the CrowdPlay Atari dataset (there's not an equivalent for Mujoco envs because humans can't really do well on those environments involved without assistance or a lot of practice). If you want something else, there's always D4RL.

The simplest possible acceptable value learning benchmark would look something like this:

Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
The bot is first given all their inputs and outputs, but not their rewards.
Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
The goal is to maximize the true reward function.
These rules are precisely described and are known by anyone who wants to test their algorithms.

None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.

A harder and more realistic version that I have yet to see in any paper would look something like this:

Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
The IO of both players is different, just like different people have different perspectives in real life.
A very good imitation learner is trained to predict the first player's output given their input. It comes with the benchmark.
The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn't see the rewards. It also isn't given any of the recordings.
Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
It plays the game with the human imitator.
The goal is to maximize the human’s reward function.

It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.

None of the environments and datasets you mention are actually like this.

Every single algorithmic IRL paper on video games does this, at least with Deep RL demonstrators. (Here's a list of 4 examples: https://arxiv.org/abs/1810.10593, https://proceedings.mlr.press/v97/brown19a.html, https://arxiv.org/abs/1902.07742, https://arxiv.org/abs/2002.09089, )

If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven't been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: https://arxiv.org/abs/1908.02511v2 and https://arxiv.org/abs/2004.00981v2 .

My guess is the reason there hasn't been much recent work in this area is because there just aren't many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs.

I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.

It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one.

The simplest possible acceptable value learning benchmark would look something like this:

Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
The bot is first given all their inputs and outputs, but not their rewards.
Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
The goal is to maximize the true reward function.
These rules are precisely described and are known by anyone who wants to test their algorithms.

A harder and more realistic version that I have yet to see in any paper would look something like this:

Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
The IO of both players is different, just like different people have different perspectives in real life.
A very good imitation learner is trained to predict the first player's output given their input. It comes with the benchmark.
The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn't see the rewards. It also isn't given any of the recordings.
Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
It plays the game with the human imitator.
The goal is to maximize the human’s reward function.

It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.

None of the environments and datasets you mention are actually like this.

99

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

99

Ω 45

About the project

Big picture

Current goals

What we’ve done so far

People + Logistics

Job Descriptions

Generalist Software Engineer, Web Dev and ML focus

DEC 2022 EDIT: SEE NEW POSTING HERE

Background

Job Mission:

Key Outcomes:

Someday/maybe outcomes:

Key Competencies:

Essential

Nice-to-have

Fit

Reasons this might be a bad fit for you

Reasons this might be an especially good fit

Other info

Location

Pay

Application process

===================================================

Researcher

DEC 2022 EDIT: SEE NEW POSTING HERE

Job Mission:

Key Competencies:

Essential

Possible bonus skills

Fit

Reasons this might be a bad fit

Reasons this might be an especially good fit

Other info

Location

Pay

Application process

99

Ω 45

99

Ω 45